Getting Out of the Tool Box: Text and Data Mining

Description

Getting Out of the Tool Box: Text and Data Mining for the Humanities Matthew G. Kirschenbaum University of Maryland MLA, Philadelphia 2004 Cow Tools “Inevitably I began to think about cows, and what if they, too, were discovered as toolmakers. What would they make? . . . I imagined, and subsequently drew, a cow standing next to her workbench, proudly displaying her handiwork (hoofiwork?). The ‘cow tools’ were supposed to be just meaningless artifacts—only the cow, or a cowthropologist, is supposed to know what they’re used for.” –Gary Larson

Comments
Would you like to comment?

Sign In if already a member, or Join Now for a free account.

Presentation Transcript Presentation Transcript

Getting Out of the Tool Box: Text and Data Mining for the Humanities : Getting Out of the Tool Box: Text and Data Mining for the Humanities Matthew G. Kirschenbaum University of Maryland MLA, Philadelphia 2004

Cow Tools : Cow Tools “Inevitably I began to think about cows, and what if they, too, were discovered as toolmakers. What would they make? . . . I imagined, and subsequently drew, a cow standing next to her workbench, proudly displaying her handiwork (hoofiwork?). The ‘cow tools’ were supposed to be just meaningless artifacts—only the cow, or a cowthropologist, is supposed to know what they’re used for.” –Gary Larson

The Humanist As Tool Using Primate : The Humanist As Tool Using Primate Mechanical Hinman Collator Magnifying Glass Camera, Light Box Inscriptive Pencil, Typewriter Textual Editions, Concordances, Bibliographies Indexical Note Cards, Vertical Files, Card Catalog Archival Library

Various Domains of Digital Tools : Various Domains of Digital Tools Text Analysis and Linguistic Corpora Index Thomisticus, TUSTEP, TACT, Collate, TAPOR Word Processing and Desktop Publishing Email, Newsgroups, MUDs/MOOs, Blogs Electronic Editions and Archives Descriptive Markup and/or High-Resolution Images Computer Modeling CAD and Virtual Reality Games and Speculative Computing IVANHOE, Temporal Modeling

Pattern Recognition I : Pattern Recognition I

Pattern Recognition II : Pattern Recognition II “He understood how much it meant to him, the roll and flip of data on a screen. He studied the figural diagrams that brought organic patterns into play, birdwing and chambered shell. It was shallow thinking to maintain that numbers and charts were the cold compression of unruly human energies, every sort of yearning and midnight sweat reduced to lucid units in the financial markets. In fact, data itself was soulful and glowing, a dynamic aspect of the life process. This was the eloquence of alphabets and numeric systems, now fully realized in electronic form, in the zero-oneness of the world, the digital imperative that defined every breath of the planet’s living billions. Here was the heave of the biosphere. Our bodies and oceans were here, knowable and whole.” --Don Delillo, Cosmopolis (2003)

Pattern Recognition III : Pattern Recognition III

Data Mining I : Data Mining I Data mining is not a search It’s not an advanced search It’s not a really good or a really fast or a really big search Look, forget searching!

Data Mining II : Data Mining II “The semi-automated discovery of trends and patterns across very large datasets” (Hearst 3). Archetype: Don Swanson’s association of magnesium deficiency with migraine headaches by mining bio-medical literature in the 1980s “A new, potentially plausible medical hypothesis was derived from a combination of text fragments and the explorer’s medical expertise” (Hearst 6).

Text Mining I : Text Mining I Differs from data mining in that the source data is typically unstructured—i.e., natural language Email spam filtering, wire news feeds are typical applications GATE (General Architecture for Text Engineering) can split sentences, identify part of speech, and perform other pre-processing tasks

Text Mining II : Text Mining II

Visualization to Communicate Results : Visualization to Communicate Results “The use of computer-supported, interactive, visual representations of abstract data to amplify cognition.” Often in the literature there is also mention of vast quantities of data with which the researchers must contend, pattern recognition as a primary heuristic, and the potential for generating multiple and dynamic visual states. Since the rise of high-performance graphical workstations in the mid-1980s, visualization has been increasingly commonplace in most major scientific research fields: astronomy, biology, chemistry, economics, engineering, environmental sciences and geology, geography, meteorology, physics, and mathematics, for example.

NORA Project : NORA Project $600,000 over two years from the Andrew W. Mellon Foundation Directed by John Unsworth, Dean of Graduate School of Library and Information Science, UIUC Involves individual developers and research centers at Georgia, Maryland, and Virginia Brings together humanists, library and information specialists, and computer scientists

NORA’s Initial Content Domain : NORA’s Initial Content Domain About 5 GB of 18th and 19th Century British and American Literature The Library at the University of North Carolina at Chapel Hill will contribute over 1,000 texts, mostly from the 19th century. The Library at the University of Virginia will contribute 600 to 1200 texts from the Early American Fiction project. The Institute for Advanced Technology in the Humanities at the University of Virginia will contribute about 6,000 texts from its projects. The University of California at Davis will contribute about 120 texts of 19th-century British women poets. The University of Michigan will provide 175 volumes of American verse, plus literary materials from other collections, such as the Making of America journals, which include titles such as the Southern Literary Messenger, Ladies Repository, Appleton's, and Vanity Fair. The University of Indiana will provide over 1,100 literary texts from the Wright American fiction collection, the Victorian Women Writers Project, and a Swinburne project. Brown University's Women Writers Project will contribute 40 literary texts from the 19th century. The Perseus project will contribute about its 19th-century literary texts, including several works by Charles Dickens.

NORA’s Guiding Principles : NORA’s Guiding Principles Web-deliverable services Domain-specific visualizations Visualizations as interface Realize the potential of joining over ten years of digital library and archive building with state of the art text analysis tools Text mining on structured textual data, specifically XML/SGML Tamarind as data repository (same principle as a search engine’s indices) User-centered design

D2K/T2K (NCSA ALG) : D2K/T2K (NCSA ALG) General purpose data mining architecture Offers users a visual programming environment for building data flows (“itineraries”) out of specific data mining applications (“modules”); visualizations as output Modules are fully programmable and extensible via a standard API T2K: library of D2K modules for text analysis (document clustering and classification)

NORA’s Audiences and Objectives : NORA’s Audiences and Objectives Scholarly research—assist in the discovery of new knowledge Remember that the final stage of a data mining operation is evaluation and subsequent investigation by a human authority—so we’re not talking about having a computer prove T.S. Eliot was a major influence on Shakespeare. Classroom—new pedagogies, new literacies; introduce next generation of humanists to digital tools Give students a textual landscape to play in; what do they discover? What do they learn? Alternative to term paper—apply tools of literary criticism and scholarship.

NORA as a Prime Instance of Graphesis : NORA as a Prime Instance of Graphesis “Graphesis is concerned with the study of visual epistemology as a dynamic, subjective process. It takes as its objects of study the history of visual forms, graphical expressions, and the concepts they embody within a social, cultural history. It seeks to expose and describe the principles for structuring knowledge through graphical form. It examines imaging technologies as instruments whose inscriptional characteristics register informationally, and also seeks to discover the ways various typologies of form have structured systems of graphical communication, artificial vision, and computational modeling of information in graphical display. Finally, graphesis is concerned with the creation of methods of interpretation that are generative and iterative, capable of producing new knowledge through the aesthetic provocation of graphical expressions.” --Johanna Drucker

Cow Tools Redux : Cow Tools Redux “The . . . mistake I made was making one of the tools resemble a crude hacksaw.” –Gary Larson

Final Thoughts : Final Thoughts We will get out of the toolbox when we stop designing “cow tools”—tools that rely on superficial metaphors for their functionality, for example a “Lightbox”; getting out of the tool “box” means taking advantage of the computer’s native capacity for interaction, iteration, emergence, and pattern recognition Text mining and visualization asks the humanities to acknowledge a shift from documentary to algorithmic forms of evidence Visualization (graphesis) will be an essential scholarly genre for the 21st century

Interested in Working with NORA? : Interested in Working with NORA? Contact me at mgk@umd.edu Watch for news and announcements Check for our Web site at (probably) www.noraproject.org

Related Online Classes

Dharmendra Giri
Introduction to Data Warehousing - Part 1 by Dharmendra
Sat, January 10, 09 8:30 PM
(IST)
Dharmendra Giri
Introduction to Data Warehousing - Part 2 by Dharmendra
Sun, January 11, 09 8:30 PM
(IST)
Copyrights © 2009 authorGEN. All rights reserved.