Text-MiningThe process used when reading detailed text, explanatory documents and financial reports. : Text-MiningThe process used when reading detailed text, explanatory documents and financial reports.
What is Text-Mining? : What is Text-Mining? Finding interesting regularities in large textual
Where interesting means: non-trivial, hidden, previously unknown and potentially useful
Finding semantic and abstract information from the surface form of textual
Which areas are active in Text Processing? : Which areas are active in Text Processing? Natural Language
Processing Information
Retrieval Text Processing Knowledge Rep. &
Reasoning
Why Text is Tough? (M.Hearst 97) : Why Text is Tough? (M.Hearst 97) Abstract concepts are difficult to understand
“Countless” combinations of subtle, abstract relationships among concepts
Many ways to represent similar concepts
E.g. space ship, flying saucer, UFO
Concepts are difficult to visualize
Why Text is Easy? (M.Hearst 97) : Why Text is Easy? (M.Hearst 97) Highly redundant
…most text count on this property
Just about any simple system can get “good” results for simple tasks:
Pull out “important” phrases
Find “meaningfully” related words
Mentally create some sort of summary from documents
Levels of Text Processing : Levels of Text Processing Word Level
Words Properties
Stop-Words
Stemming
Frequent N-Grams
Thesaurus
Words Properties : Words Properties Relations among word surface forms and their senses:
Homonomy: same form, but different meaning (e.g. bank: river bank, financial institution)
Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution)
Synonymy: different form, same meaning (e.g. singer, vocalist)
Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal)
Word frequencies in texts have power distribution:
…small number of very frequent words
…big number of low frequency words
Stop-words : Stop-words Stop-words are words that from
non-linguistic view that do not carry information
…they have mainly functional role
…usually we remove them to help the methods to perform better
Natural language dependent – examples:
a, about, above, across, after, again, against, all, almost, alone, along, already, also, ...
Slide 9 : Original text
Information Systems Asia Web - provides research, IS-related commercial materials, interaction, and even research sponsorship by interested corporations with a focus on Asia Pacific region.
Survey of Information Retrieval - guide to IR, with an emphasis on web-based projects. Includes a glossary, and pointers to interesting papers. After the stop-words removal
Information Systems Asia Web provides research IS-related commercial materials interaction research sponsorship interested corporations focus Asia Pacific region
Survey Information Retrieval guide IR emphasis web-based projects Includes glossary pointers interesting papers
Stemming (I) : Stemming (I) Different forms of the same word are
usually problematic for text analysis,
because they have different spelling and
similar meaning (e.g. learns, learned,
learning,…)
Stemming is a process of transforming a
word into its stem (normalized form)
Phrases in the form of frequent N-Grams : Phrases in the form of frequent N-Grams Simple way for generating phrases are frequent n-grams:
N-Gram is a sequence of n consecutive words (e.g. “machine learning” is 2-gram)
“Frequent n-grams” are the ones which appear in all observed documents
N-grams are interesting because of the simple and efficient dynamic programming algorithm:
Given:
Set of documents (each document is a sequence of words),
MinFreq (minimal n-gram frequency),
MaxNGramSize (maximal n-gram length)
Slide 12 : Original text on the Yahoo Web page:
1.Top:Reference:Libraries:Library and Information Science:Information Retrieval
2.UK Only
3.Idomeneus - IR \& DB repository - These pages mostly contain IR related resources such as test collections, stop lists, stemming algorithms, and links to other IR sites.
4.University of Glasgow - Information Retrieval Group - information on the resources and people in the Glasgow IR group.
5.Centre for Intelligent Information Retrieval (CIIR).
6.Information Systems Asia Web - provides research, IS-related commercial materials, interaction, and even research sponsorship by interested corporations with a focus on Asia Pacific region.
7.Seminar on Cataloging Digital Documents
8.Survey of Information Retrieval - guide to IR, with an emphasis on web-based projects. Includes a glossary, and pointers to interesting papers.
9.University of Dortmund - Information Retrieval Group Document represented by n-grams:
1."REFERENCE LIBRARIES LIBRARY INFORMATION SCIENCE (\#3 LIBRARY INFORMATION SCIENCE) INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL)"
2."UK"
3."IR PAGES IR RELATED RESOURCES COLLECTIONS LISTS LINKS IR SITES"
4."UNIVERSITY GLASGOW INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL) GROUP INFORMATION RESOURCES (\#2 INFORMATION RESOURCES) PEOPLE GLASGOW IR GROUP"
5."CENTRE INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL)"
6."INFORMATION SYSTEMS ASIA WEB RESEARCH COMMERCIAL MATERIALS RESEARCH ASIA PACIFIC REGION"
7."CATALOGING DIGITAL DOCUMENTS"
8."INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL) GUIDE IR EMPHASIS INCLUDES GLOSSARY INTERESTING"
9."UNIVERSITY INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL) GROUP"
Word relationships – lexical relations : Word relationships – lexical relations Word relationships is a well developed and effective
…it consist from 4 classes (nouns, verbs, adjectives, and adverbs)
Each class consists from sense entries consisting from
a set of synonyms, e.g.:
musician, instrumentalist, player
person, individual, someone
life form, organism, being
Word relationships : Word relationships Each word entry is connected with another.
Relations in the database of nouns:
Sentence Level : Document Level
Summarization
Single Document Visualization
Text Segmentation Sentence Level
Summarization : Summarization Task: the task is to produce shorter, summary
version of an original document.
Two main approaches to the problem:
Knowledge rich – performing semantic analysis, representing the meaning and generating the text satisfying length restriction
Selection based
Selection based summarization : Selection based summarization Main phases:
Analyzing the source text
Determining its important points
Slide 18 : Selected units Selection
threshold Example of selection based approach from MS Word
Visualization of a single document : Visualization of a single document
Why visualization of a single document is hard? : Why visualization of a single document is hard? Visualizing of big text is easier task because of the mass amount of information:
...statistics already starts working
...most known approaches are statistics based
Visualization of a single (possibly short)
document is much harder task because:
...we can not count of statistical properties of the text (lack of data)
...we must rely on syntactical and logical structure of the document
Simple approach : Simple approach The text is split into the sentences.
Anaphora resolution is performed on all sentences
...all ‘he’, ‘she’, ‘they’, ‘him’, ‘his’, ‘her’, etc. references to the objects are replaced by its proper name
From all the sentences we extract:
Subject-Predicate-Object triples (SPO)
Text Segmentation : Text Segmentation
Text Segmentation : Text Segmentation Problem: divide text that has no given structure into segments with similar content
Example applications:
topic tracking in news (spoken news)
identification of topics in large, unstructured text
Visualization : Visualization
Why text visualization? : Why text visualization? ...to have a top level view of the topics
...to see relationships between the topics
...to understand better what’s going on
...to view highly structured nature of textual contents in a simplified way
How do we extract keywords? : How do we extract keywords? Characteristic keywords for a group of
documents are the most highly weighted
words in the center of the cluster
...center of the cluster could be understood as an “average document” for specific group of documents
...efficient solution
Information Extraction : Information Extraction The mental process that one goes through when reading and analyzing documents of any kind. This is a process that for a varying degree do automatically once we the mental processes down.
What is “Information Extraction” : What is “Information Extraction” Filling in the slots from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
What is “Information Extraction” : What is “Information Extraction” Filling in the slots from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman founder Free Soft.. IE
What is “Information Extraction” : What is “Information Extraction” Information Extraction =
segmentation + classification + clustering + association As a familyof techniques: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation aka “named entity extraction”
What is “Information Extraction” : What is “Information Extraction” Information Extraction =
segmentation + classification + association + clustering As a familyof techniques: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction” : What is “Information Extraction” Information Extraction =
segmentation + classification + association + clustering As a familyof techniques: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction” : What is “Information Extraction” Information Extraction =
segmentation + classification + association + clustering As a familyof techniques: October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation * * * *
Levels of Text Processing : Levels of Text Processing Question-Answering
Question Answering : Question Answering QA Systems are returning short and accurate
replies to the well-formed natural language
questions such as:
What is the hight of Mount Everest?
After which animal is the Canary Island named?
How many liters are there in to a gallon?
QA Systems can be classified into levels of
sophistication:
Slot-filling – easy questions
Question Answering Example : Question Answering Example Example question and answer:
Q:What is the color of grass?
A: Green.
…the answer may come from the document saying: “grass is green” without mentioning “color”
hypernym hierarchy:
green, chromatic color, color, visual property, property