WizIQ helps you learn and teach online - any subject you can think of!
Join for FREE

TEXT Mining

Add to Favourites
Post to:

Description
When writing when has to consider a couple of factors. First, who will be reading the document and how they will read it.

Comments
Presentation Transcript Presentation Transcript

Text-MiningThe process used when reading detailed text, explanatory documents and financial reports. : Text-MiningThe process used when reading detailed text, explanatory documents and financial reports.

What is Text-Mining? : What is Text-Mining? Finding interesting regularities in large textual Where interesting means: non-trivial, hidden, previously unknown and potentially useful Finding semantic and abstract information from the surface form of textual

Which areas are active in Text Processing? : Which areas are active in Text Processing? Natural Language Processing Information Retrieval Text Processing Knowledge Rep. & Reasoning

Why Text is Tough? (M.Hearst 97) : Why Text is Tough? (M.Hearst 97) Abstract concepts are difficult to understand “Countless” combinations of subtle, abstract relationships among concepts Many ways to represent similar concepts E.g. space ship, flying saucer, UFO Concepts are difficult to visualize

Why Text is Easy? (M.Hearst 97) : Why Text is Easy? (M.Hearst 97) Highly redundant …most text count on this property Just about any simple system can get “good” results for simple tasks: Pull out “important” phrases Find “meaningfully” related words Mentally create some sort of summary from documents

Levels of Text Processing : Levels of Text Processing Word Level Words Properties Stop-Words Stemming Frequent N-Grams Thesaurus

Words Properties : Words Properties Relations among word surface forms and their senses: Homonomy: same form, but different meaning (e.g. bank: river bank, financial institution) Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution) Synonymy: different form, same meaning (e.g. singer, vocalist) Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal) Word frequencies in texts have power distribution: …small number of very frequent words …big number of low frequency words

Stop-words : Stop-words Stop-words are words that from non-linguistic view that do not carry information …they have mainly functional role …usually we remove them to help the methods to perform better Natural language dependent – examples: a, about, above, across, after, again, against, all, almost, alone, along, already, also, ...

Slide 9 : Original text Information Systems Asia Web - provides research, IS-related commercial materials, interaction, and even research sponsorship by interested corporations with a focus on Asia Pacific region. Survey of Information Retrieval - guide to IR, with an emphasis on web-based projects. Includes a glossary, and pointers to interesting papers. After the stop-words removal Information Systems Asia Web provides research IS-related commercial materials interaction research sponsorship interested corporations focus Asia Pacific region Survey Information Retrieval guide IR emphasis web-based projects Includes glossary pointers interesting papers

Stemming (I) : Stemming (I) Different forms of the same word are usually problematic for text analysis, because they have different spelling and similar meaning (e.g. learns, learned, learning,…) Stemming is a process of transforming a word into its stem (normalized form)

Phrases in the form of frequent N-Grams : Phrases in the form of frequent N-Grams Simple way for generating phrases are frequent n-grams: N-Gram is a sequence of n consecutive words (e.g. “machine learning” is 2-gram) “Frequent n-grams” are the ones which appear in all observed documents N-grams are interesting because of the simple and efficient dynamic programming algorithm: Given: Set of documents (each document is a sequence of words), MinFreq (minimal n-gram frequency), MaxNGramSize (maximal n-gram length)

Slide 12 : Original text on the Yahoo Web page: 1.Top:Reference:Libraries:Library and Information Science:Information Retrieval 2.UK Only 3.Idomeneus - IR \& DB repository - These pages mostly contain IR related resources such as test collections, stop lists, stemming algorithms, and links to other IR sites. 4.University of Glasgow - Information Retrieval Group - information on the resources and people in the Glasgow IR group. 5.Centre for Intelligent Information Retrieval (CIIR). 6.Information Systems Asia Web - provides research, IS-related commercial materials, interaction, and even research sponsorship by interested corporations with a focus on Asia Pacific region. 7.Seminar on Cataloging Digital Documents 8.Survey of Information Retrieval - guide to IR, with an emphasis on web-based projects. Includes a glossary, and pointers to interesting papers. 9.University of Dortmund - Information Retrieval Group Document represented by n-grams: 1."REFERENCE LIBRARIES LIBRARY INFORMATION SCIENCE (\#3 LIBRARY INFORMATION SCIENCE) INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL)" 2."UK" 3."IR PAGES IR RELATED RESOURCES COLLECTIONS LISTS LINKS IR SITES" 4."UNIVERSITY GLASGOW INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL) GROUP INFORMATION RESOURCES (\#2 INFORMATION RESOURCES) PEOPLE GLASGOW IR GROUP" 5."CENTRE INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL)" 6."INFORMATION SYSTEMS ASIA WEB RESEARCH COMMERCIAL MATERIALS RESEARCH ASIA PACIFIC REGION" 7."CATALOGING DIGITAL DOCUMENTS" 8."INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL) GUIDE IR EMPHASIS INCLUDES GLOSSARY INTERESTING" 9."UNIVERSITY INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL) GROUP"

Word relationships – lexical relations : Word relationships – lexical relations Word relationships is a well developed and effective …it consist from 4 classes (nouns, verbs, adjectives, and adverbs) Each class consists from sense entries consisting from a set of synonyms, e.g.: musician, instrumentalist, player person, individual, someone life form, organism, being

Word relationships : Word relationships Each word entry is connected with another. Relations in the database of nouns:

Sentence Level : Document Level Summarization Single Document Visualization Text Segmentation Sentence Level

Summarization : Summarization Task: the task is to produce shorter, summary version of an original document. Two main approaches to the problem: Knowledge rich – performing semantic analysis, representing the meaning and generating the text satisfying length restriction Selection based

Selection based summarization : Selection based summarization Main phases: Analyzing the source text Determining its important points

Slide 18 : Selected units Selection threshold Example of selection based approach from MS Word

Visualization of a single document : Visualization of a single document

Why visualization of a single document is hard? : Why visualization of a single document is hard? Visualizing of big text is easier task because of the mass amount of information: ...statistics already starts working ...most known approaches are statistics based Visualization of a single (possibly short) document is much harder task because: ...we can not count of statistical properties of the text (lack of data) ...we must rely on syntactical and logical structure of the document

Simple approach : Simple approach The text is split into the sentences. Anaphora resolution is performed on all sentences ...all ‘he’, ‘she’, ‘they’, ‘him’, ‘his’, ‘her’, etc. references to the objects are replaced by its proper name From all the sentences we extract: Subject-Predicate-Object triples (SPO)

Text Segmentation : Text Segmentation

Text Segmentation : Text Segmentation Problem: divide text that has no given structure into segments with similar content Example applications: topic tracking in news (spoken news) identification of topics in large, unstructured text

Visualization : Visualization

Why text visualization? : Why text visualization? ...to have a top level view of the topics ...to see relationships between the topics ...to understand better what’s going on ...to view highly structured nature of textual contents in a simplified way

How do we extract keywords? : How do we extract keywords? Characteristic keywords for a group of documents are the most highly weighted words in the center of the cluster ...center of the cluster could be understood as an “average document” for specific group of documents ...efficient solution

Information Extraction : Information Extraction The mental process that one goes through when reading and analyzing documents of any kind. This is a process that for a varying degree do automatically once we the mental processes down.

What is “Information Extraction” : What is “Information Extraction” Filling in the slots from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

What is “Information Extraction” : What is “Information Extraction” Filling in the slots from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

What is “Information Extraction” : What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a familyof techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction”

What is “Information Extraction” : What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a familyof techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” : What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a familyof techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” : What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a familyof techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * *

Levels of Text Processing : Levels of Text Processing Question-Answering

Question Answering : Question Answering QA Systems are returning short and accurate replies to the well-formed natural language questions such as: What is the hight of Mount Everest? After which animal is the Canary Island named? How many liters are there in to a gallon? QA Systems can be classified into levels of sophistication: Slot-filling – easy questions

Question Answering Example : Question Answering Example Example question and answer: Q:What is the color of grass? A: Green. …the answer may come from the document saying: “grass is green” without mentioning “color” hypernym hierarchy: green, chromatic color, color, visual property, property

Want to learn?

Sign up and browse through relevant courses.

Name:
Your Email:
Password:
Country:
Contact no.:


Area code Number
Subject you are interested in:
Word verification: (Enter the text as in image)


Sign Up Already a member? Sign In
I agree to WizIQ's User Agreement & Privacy Policy
ESSI-EDU
Communicating Effectively and Professionally
User
2 Followers

Your Facebook Friends on WizIQ