Slide 1 : 1 J.S.Paresh
K.Shyam Sunder
Natural? : 2 Natural? Natural Language?
Refers to the language spoken by people, e.g. English, Japanese, Swahili, as opposed to artificial languages, like C++, Java, etc.
Natural Language Processing
Applications that deal with natural language in a way or another and it is the subfield of Artificial Intelligence
Computational Linguistics
Doing linguistics on computers
More on the linguistic side than NLP, but closely related
What is Artificial Intelligence? : 3 What is Artificial Intelligence? The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden)
AI is the study of how to do things which at the moment people do better (Rich & Knight)
AI is the science of making machines do things that would require intelligence if done by men. (Minsky)
Why Natural Language Processing? : 4 Why Natural Language Processing? kJfmmfj mmmvvv nnnffn333
Uj iheale eleee mnster vensi credur
Baboi oi cestnitze
Coovoel2^ ekk; ldsllk lkdf vnnjfj?
Fgmflmllk mlfm kfre xnnn!
Computers Lack Knowledge! : 5 Computers Lack Knowledge! Computers “see” text in English the same you have seen the previous text!
People have no trouble understanding language
Common sense knowledge
Reasoning capacity
Experience
Computers have
No common sense knowledge
No reasoning capacity
Unless we teach them!
Why Natural Language Processing? : 6 Why Natural Language Processing? Huge amounts of data
Internet = at least 8 billion pages
Intranet
Applications for processing large amounts of texts
require NLP expertise Classify text into categories
Index and search large texts
Automatic translation
Speech understanding
Understand phone conversations
Information extraction
Extract useful information from resumes
Automatic summarization
Condense 1 book into 1 page
Question answering
Knowledge acquisition
Text generations / dialogs
Where does it fit in the CS taxonomy? : 7 Where does it fit in the CS taxonomy? Computers Artificial Intelligence Algorithms Databases Networking Robotics Search Natural Language Processing Information
Retrieval Machine
Translation Language
Analysis Semantics Parsing
Situating NLP : 8 Situating NLP computer science psychology/cognitive science linguistics math/statistics philosophy communication NLP
Theoretical foundations : 9 Theoretical foundations math: statistics, calculus, algebra, modeling
computational paradigms: connectionist, rule-based, cognitively plausible
linguistics: LFG, HPSG, GB, OT, CG, etc.
architectures: stacks, automata, networks, compilers
Some areas of research : 10 Some areas of research Corpora, tools, resources, standards
Language/grammar engineering
Machine (assisted) translation, tools
Language modeling
Lexicography
Speech
Slide 11 : 11 Linguistics Essentials
The Description of Language : 12 The Description of Language Language = Words and Rules
Dictionary (vocabulary) + Grammar
Dictionary
set of words defined in the language
open (dynamic)
Traditional
paper based
Electronic
machine readable dictionaries; can be obtained from paper-based
Grammar
set of rules which describe what is allowable in a language
Classic Grammars
meant for humans who know the language
definitions and rules are mainly supported by examples
no (or almost no) formal description tools; cannot be programmed
Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...)
formal description can be programmed & tested on data (texts)
Linguistics Levels of Analysis : 13 Linguistics Levels of Analysis Speech
Written language
Phonology: sounds / letters / pronunciation
Morphology: the structure of words
Syntax: how these sequences are structured
Semantics: meaning of the strings
Interaction between levels where each level has an input and an output.
Phonetics/Orthography : 14 Phonetics/Orthography Input:
acoustic signal (phonetics) / text (orthography)
Output:
phonetic alphabet (phonetics) / text (orthography)
Deals with:
Phonetics:
consonant & vowel (& others) formation in the vocal tract
classification of consonants, vowels, ... in relation to frequencies, shape & position of the tongue and various muscles
intonation
Orthography: normalization, punctuation, etc.
Phonology : 15 Phonology Input:
sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes]
Output:
sequence of phonemes (~ (lexical) letters; in an abstract alphabet)
Deals with:
relation between sounds and phonemes (units which might have some function on the upper level)
e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)
Morphology : 16 Morphology Input:
sequence of phonemes (~ (lexical) letters)
Output:
sequence of pairs (lemma, (morphological) tag)
Deals with:
composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding)
e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.
...and Beyond : 17 ...and Beyond Input:
sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions)
Output:
logical form, which can be evaluated (true/false)
Deals with:
assignment of objects from the real world to the nodes of the sentence structure
e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~
see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])
Phonology : 18 Phonology (Surface « Lexical) Correspondence
“symbol-based” (no complex structures)
Ex.: (stem-final change)
lexical: b a b y + s (+ denotes start of ending)
surface: b a b i e s (phonetic-related: bébì0s)
Arabic: (interfixing, inside-stem doubling)
lexical: kTb+uu+CVCCVC (CVCC...vowel/consonant pattern)
surface: kuttub
Phonology Examples : 19 Phonology Examples German (umlaut) (satz ~ sentence)
lexical: s A t z + e (A denotes “umlautable” a)
surface: s ä t z e (phonetic: zæce, vs. zac)
Turkish (vowel harmony)
lexical: e v + l A r (~house)
surface: e v l e r
Morphology: Morphemes & Order : 20 Morphology: Morphemes & Order Scientific study of forms of words
Grouping of phonemes into morphemes
sequence deliverables ~ deliver, able and s (3 units)
could as well be some “ID” numbers:
e.g. deliver ~ 23987, s ~ 12, able ~ 3456
Morpheme Combination
certain combinations/sequencing possible, other not:
deliver+able+s, but not able+derive+s; noun+s, but not noun+ing
typically fixed (in any given language)
The Dictionary (or Lexicon) : 21 The Dictionary (or Lexicon) Repository of information about words:
Morphological:
description of morphological “behavior”: inflection patterns/classes
Syntactic:
Part of Speech
relations to other words:
subcategorization (or “surface valency frames”)
Semantic:
semantic features
frames
...and any other! (e.g., translation)
(Surface) Syntax : 22 (Surface) Syntax Input:
sequence of pairs (lemma, (morphological) tag)
Output:
sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms
Deals with:
the relation between lemmas & morphological categories and the sentence structure
uses syntactic categories such as Subject, Verb, Object,...
e.g.: I/PP1 see/VB a/DT dog/NN ~
((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S
Issues in Syntax : 23 Issues in Syntax “the dog ate my homework” - Who did what?
Identify the part of speech (POS)
Dog = noun ; ate = verb ; homework = noun
English POS tagging: 95%
Can be improved!
Part of speech tagging on other languages almost inexistent
2. Identify collocations
mother in law, hot dog
Compositional versus non-compositional collocates
Issues in Syntax : 24 Issues in Syntax Shallow parsing:
“the dog chased the bear”
“the dog” “chased the bear”
subject - predicate
Identify basic structures
NP-[the dog] VP-[chased the bear]
Shallow parsing on new languages
Shallow parsing with little training data
Issues in Syntax : 25 Issues in Syntax Full parsing: John loves Mary Current precisions: 85-88% Help figuring out (automatically) questions like: Who did what
and when?
Meaning (semantics) : 26 Meaning (semantics) Input:
sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions)
Output:
sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions)
Deals with:
relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s
e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~
(I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)
Issues in Semantics : 27 Issues in Semantics Understand language! How?
“plant” = industrial plant
“plant” = living organism
Words are ambiguous
Importance of semantics?
Machine Translation: wrong translations
Information Retrieval: wrong information
Anaphora Resolution: wrong referents
Why Semantics? : 28 The sea is at the home for billions of factories and animals
The sea is home to million of plants and animals
English French [commercial MT system]
Le mer est a la maison de billion des usines et des animaux
French English Why Semantics?
Issues in Semantics : 29 Issues in Semantics How to learn the meaning of words?
From dictionaries:
plant, works, industrial plant -- (buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles")
plant, flora, plant life -- (a living organism lacking the power of locomotion)
They are producing about 1,000 automobiles in the new plant
The sea flora consists in 1,000 different plant species
The plant was close to the farm of animals.
Issues in Semantics : 30 Issues in Semantics Learn from annotated examples:
Assume 100 examples containing “plant” previously tagged by a human
Train a learning algorithm
Precisions in the range 60%-70%-(80%)
How to choose the learning algorithm?
How to obtain the 100 tagged examples?
Issues in Learning Semantics : 31 Issues in Learning Semantics Learning?
Assume a (large) amount of annotated data = training
Assume a new text not annotated = test
Learn from previous experience (training) to classify new data (test)
Decision trees, memory based learning, neural networks
Machine Learning
Which one performs best?
Issues in Semantics : 32 Issues in Semantics Automatic annotation of data
Active learning
Identify only the hard examples
Co-training
Identify the examples where several techniques agree on the semantic tag
Collecting from Web users
Open Mind Word Expert
Problems faced by Natural Language-Understanding Systems : 33 Problems faced by Natural Language-Understanding Systems
Speech & Text segmentation : 34 Speech & Text segmentation In spoken language, sounds representing succesive letters blend into each other
This makes the conversion of the analog signal to discrete characters very difficult
Regarding Text Segmentation , Some written languages like chinese, japanese and thai don’t have signal word boundaries.
So any significant text parsing requires identifying word boundaries, which is often a non-trivial tasks
Word sense disambiguation : 35 Word sense disambiguation Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities.
Sense Inventory usually comes from a dictionary or thesaurus.
Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches
Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.
Unsupervised techniques
Word sense disambiguationComputers versus Humans : 36 Word sense disambiguationComputers versus Humans Polysemy – most words have many possible meanings.
A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human…
Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases…
Word sense disambiguation Ambiguity for a Computer : 37 Word sense disambiguation Ambiguity for a Computer The fisherman jumped off the bank and into the water.
The bank down the street was robbed!
Back in the day, we had an entire bank of computers devoted to this problem.
The bank in that road is entirely too steep and is really dangerous.
The plane took a bank to the left, and then headed off towards the mountains.
Syntactic ambiguity : 38 Syntactic ambiguity There are often multiple possible parse trees for a given sentence.
Choosing the most appropriate one usually requires semantic and contextual information.
Specific problem components here are:
Sentence boundary disambiguation
Imperfect input
Foreign or regional accents etc.
Statistical NLP : 39 Statistical NLP Statistical NLP uses stochastic, probabilistic and statistical methods to resolve some difficulties of NLP
Methods for disambiguation of an involve the use of corpora & Markov models.
Technology for statistical NLP comes from machine learning and data mining both of which involve learning from data.
Major Tasks in NLP : 40 Major Tasks in NLP Speech Recognition
Natural Language Generation
Machine Translation
Information Retrieval
Information Extraction
Text Simplification
Automatic summarization
Foreign Language Reading & writing aid
Speech Recognition : 41 Speech Recognition It is the process of converting a speech signal to a sequence of words, by means of an algorithm (as computer program).
Applications are :
Voice dialing
Call routing
Simple data entry
Preparation of structure documents
Natural Language generation : 42 Natural Language generation It is a task of generating Natural Language from a machine representation system such as a knowledge base or a logical form.
Ex: Choose randomly among outputs:
– Visitant which came into the place where it will be Japanese has
admired that there was Mount Fuji.
Top 10 outputs according to bigram probabilities:
– Visitors who came in Japan admire Mount Fuji.
– Visitors who came in Japan admires Mount Fuji.
– Visitors who arrived in Japan admire Mount Fuji.
– A visitor who came in Japan admire Mount Fuji.
– The visitor who came in Japan admire Mount Fuji.
– Visitors who came in Japan admire Mount Fuji.
– The visitor who came in Japan admires Mount Fuji.
– Mount Fuji is admired by a visitor who came in Japan.
Machine Translations : 43 Machine Translations Machine Translation or MT is a sub-field of computational linguistics that investigates usage of computer software to translate text or speech from one natural language to another
Issues in Machine Translations : 44 Issues in Machine Translations Text to Text Machine Translations
Speech to Speech Machine Translations
Most of the work has addressed pairs of widely spread languages like English-French, English-Chinese
How to translate text?
Learn from previously translated data
Need parallel corpora
French-English, Chinese-English have the Hansards
Reasonable translations
Chinese-Hindi – no such tools available today!
Issues in Machine Translations : 45 Issues in Machine Translations How to obtain parallel texts?
From the Web! How?
From Web users! How?
Once we have the texts, how to get most out of them?
Word alignments
Obtain lexicons
Import knowledge from well studied languages
Information Extraction : 46 Information Extraction It’s a type of information retrieval whose goal is to automatically extract structured or semi structured information from unstructured machine readable documents.
Its significance is determined by the growing amount of information available in unstructured form, for instance on the Internet.
Issues in Information Extraction : 47 Issues in Information Extraction “There was a group of about 8-9 people close to the entrance on Highway 75”
Who? “8-9 people”
Where? “highway 75”
Extract information
Detect new patterns:
Detect hacking / hidden information / etc.
Gov./mil. puts lots of money put into IE research
Information Retrieval : 48 Information Retrieval Information Retrieval (IR) is a science of searching
for information in documents,
for documents themselves,
for metadata or
searching with in databases (any kind).
Issues in Information Retrieval : 49 Issues in Information Retrieval Index meaning
Search for plant (=living organism) should not retrieve texts with plant (=industrial plant)
But should retrieve documents including “flora” or other related terms
Index parsed relations
Issues in Information Retrieval : 50 Issues in Information Retrieval Retrieve specific information
Question Answering
“What is the height of mount Everest?”
11,000 feet
Current state-of-the-art 40-50%
Improve precision with the use of more common sense knowledge
Perform domain specific question answering
Issues in Information Retrieval : 51 Issues in Information Retrieval Find information across languages!
Cross Language Information Retrieval
“What is the minimum age requirement for car rental in Italy?”
Search also Italian texts for “eta minima per noleggio macchine”
Integrate large number of languages
Integrate into performant IR engines
Text simplification & Proofreading : 52 Text simplification & Proofreading In NLP, text simplification is an important task because much of the English language is in complex compound sentences that cannot be easily processed for information task.
Proofreading traditionally means reading a proof copy of a text in order to detect and correct any errors.
Modern proof reading often requires reading copy at earlier stages as well.
Automatic Summarization : 53 Automatic Summarization It is the creation of a shortened version of a text by a computer program.
As access to data has increased so has interest in automatic summarization. An example of the use of summarization technology is search engines such as Google.
Technologies that can make a coherent summary, of any kind of text, need to take into account several variables such as length, writing –style and syntax to make a useful summary.
Foreign Language Writing Aid : 54 Foreign Language Writing Aid It is a computer program that assists a non-native language user in their target language.
Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks.
Assisted aspects of writing include:
Lexical syntax, Lexical semantics, idiomatic expression transfer, etc.
On-line dictionaries can also be considered as a type of foreign language writing aid.
Slide 55 : 55 Language & speech technology have advanced rapidly in the last decades.
Slide 56 : 56 It is EveR-2 Muse, a robot version of a Korean woman in her twenties (Eve+R for robot), can hold a conversation or sing a song, make eye contact, and express anger, sorrow and joy. But according to her creator, most Koreans found her homely in comparison to her predecessor
Achievements of AI/ NLP : 57 Achievements of AI/ NLP Sphinx can recognise continuous speech.
Deep Thought is an international grand master chess player. Without training for each speaker, it operates in near real time using a vocabulary of 1000 words and has 94% word accuracy.
Navlab is a truck that can drive along a road at 55mph in normal traffic.
Carlton and United Breweries use an AI planning system to plan production of their beer.
Natural language interfaces to databases can be obtained on a PC.
Machine Learning methods have been used to build expert systems.
Expert systems are used regularly in finance, medicine, manufacturing, and agriculture
If this dream comes alive… : 58 If this dream comes alive… Even a person who is ignorant of computer knowledge can interact with it through a colloquial interaction.
Almost all systems will be automated.
Many problems will have found a solution.
No one needs to learn computer languages any more, instead they can interact with the computer in their natural (regional) languages themselves.
It would be a matter of jubilance for the world as a whole…..
So lets await that wonderful day & work in this direction…. : 59 So lets await that wonderful day & work in this direction….
Thank you : 60 Thank you