Getting Out of the Tool Box: Text and Data Mining for the Humanities : Getting Out of the Tool Box: Text and Data Mining for the Humanities Matthew G. Kirschenbaum
University of Maryland
MLA, Philadelphia 2004
Cow Tools : Cow Tools “Inevitably I began to think about cows, and what if they, too, were discovered as toolmakers. What would they make? . . . I imagined, and subsequently drew, a cow standing next to her workbench, proudly displaying her handiwork (hoofiwork?). The ‘cow tools’ were supposed to be just meaningless artifacts—only the cow, or a cowthropologist, is supposed to know what they’re used for.”
–Gary Larson
The Humanist As Tool Using Primate : The Humanist As Tool Using Primate Mechanical
Hinman Collator
Magnifying Glass
Camera, Light Box
Inscriptive
Pencil, Typewriter
Textual
Editions, Concordances, Bibliographies
Indexical
Note Cards, Vertical Files, Card Catalog
Archival
Library
Various Domains of Digital Tools : Various Domains of Digital Tools Text Analysis and Linguistic Corpora
Index Thomisticus, TUSTEP, TACT, Collate, TAPOR
Word Processing and Desktop Publishing
Email, Newsgroups, MUDs/MOOs, Blogs
Electronic Editions and Archives
Descriptive Markup and/or High-Resolution Images
Computer Modeling
CAD and Virtual Reality
Games and Speculative Computing
IVANHOE, Temporal Modeling
Pattern Recognition I : Pattern Recognition I
Pattern Recognition II : Pattern Recognition II “He understood how much it meant to him, the roll and flip of data on a screen. He studied the figural diagrams that brought organic patterns into play, birdwing and chambered shell. It was shallow thinking to maintain that numbers and charts were the cold compression of unruly human energies, every sort of yearning and midnight sweat reduced to lucid units in the financial markets. In fact, data itself was soulful and glowing, a dynamic aspect of the life process. This was the eloquence of alphabets and numeric systems, now fully realized in electronic form, in the zero-oneness of the world, the digital imperative that defined every breath of the planet’s living billions. Here was the heave of the biosphere. Our bodies and oceans were here, knowable and whole.”
--Don Delillo, Cosmopolis (2003)
Pattern Recognition III : Pattern Recognition III
Data Mining I : Data Mining I Data mining is not a search
It’s not an advanced search
It’s not a really good or a really fast or a really big search
Look, forget searching!
Data Mining II : Data Mining II “The semi-automated discovery of trends and patterns across very large datasets” (Hearst 3).
Archetype: Don Swanson’s association of magnesium deficiency with migraine headaches by mining bio-medical literature in the 1980s
“A new, potentially plausible medical hypothesis was derived from a combination of text fragments and the explorer’s medical expertise” (Hearst 6).
Text Mining I : Text Mining I Differs from data mining in that the source data is typically unstructured—i.e., natural language
Email spam filtering, wire news feeds are typical applications
GATE (General Architecture for Text Engineering) can split sentences, identify part of speech, and perform other pre-processing tasks
Text Mining II : Text Mining II
Visualization to Communicate Results : Visualization to Communicate Results “The use of computer-supported, interactive, visual representations of abstract data to amplify cognition.”
Often in the literature there is also mention of vast quantities of data with which the researchers must contend, pattern recognition as a primary heuristic, and the potential for generating multiple and dynamic visual states.
Since the rise of high-performance graphical workstations in the mid-1980s, visualization has been increasingly commonplace in most major scientific research fields: astronomy, biology, chemistry, economics, engineering, environmental sciences and geology, geography, meteorology, physics, and mathematics, for example.
NORA Project : NORA Project $600,000 over two years from the Andrew W. Mellon Foundation
Directed by John Unsworth, Dean of Graduate School of Library and Information Science, UIUC
Involves individual developers and research centers at Georgia, Maryland, and Virginia
Brings together humanists, library and information specialists, and computer scientists
NORA’s Initial Content Domain : NORA’s Initial Content Domain About 5 GB of 18th and 19th Century British and American Literature
The Library at the University of North Carolina at Chapel Hill will contribute over 1,000 texts, mostly from the 19th century.
The Library at the University of Virginia will contribute 600 to 1200 texts from the Early American Fiction project.
The Institute for Advanced Technology in the Humanities at the University of Virginia will contribute about 6,000 texts from its projects.
The University of California at Davis will contribute about 120 texts of 19th-century British women poets.
The University of Michigan will provide 175 volumes of American verse, plus literary materials from other collections, such as the Making of America journals, which include titles such as the Southern Literary Messenger, Ladies Repository, Appleton's, and Vanity Fair.
The University of Indiana will provide over 1,100 literary texts from the Wright American fiction collection, the Victorian Women Writers Project, and a Swinburne project.
Brown University's Women Writers Project will contribute 40 literary texts from the 19th century.
The Perseus project will contribute about its 19th-century literary texts, including several works by Charles Dickens.
NORA’s Guiding Principles : NORA’s Guiding Principles Web-deliverable services
Domain-specific visualizations
Visualizations as interface
Realize the potential of joining over ten years of digital library and archive building with state of the art text analysis tools
Text mining on structured textual data, specifically XML/SGML
Tamarind as data repository (same principle as a search engine’s indices)
User-centered design
D2K/T2K (NCSA ALG) : D2K/T2K (NCSA ALG) General purpose data mining architecture
Offers users a visual programming environment for building data flows (“itineraries”) out of specific data mining applications (“modules”); visualizations as output
Modules are fully programmable and extensible via a standard API
T2K: library of D2K modules for text analysis (document clustering and classification)
NORA’s Audiences and Objectives : NORA’s Audiences and Objectives Scholarly research—assist in the discovery of new knowledge
Remember that the final stage of a data mining operation is evaluation and subsequent investigation by a human authority—so we’re not talking about having a computer prove T.S. Eliot was a major influence on Shakespeare.
Classroom—new pedagogies, new literacies; introduce next generation of humanists to digital tools
Give students a textual landscape to play in; what do they discover? What do they learn? Alternative to term paper—apply tools of literary criticism and scholarship.
NORA as a Prime Instance of Graphesis : NORA as a Prime Instance of Graphesis “Graphesis is concerned with the study of visual epistemology as a dynamic, subjective process. It takes as its objects of study the history of visual forms, graphical expressions, and the concepts they embody within a social, cultural history. It seeks to expose and describe the principles for structuring knowledge through graphical form. It examines imaging technologies as instruments whose inscriptional characteristics register informationally, and also seeks to discover the ways various typologies of form have structured systems of graphical communication, artificial vision, and computational modeling of information in graphical display. Finally, graphesis is concerned with the creation of methods of interpretation that are generative and iterative, capable of producing new knowledge through the aesthetic provocation of graphical expressions.”
--Johanna Drucker
Cow Tools Redux : Cow Tools Redux “The . . . mistake I made was making one of the tools resemble a crude hacksaw.”
–Gary Larson
Final Thoughts : Final Thoughts
We will get out of the toolbox when we stop designing “cow tools”—tools that rely on superficial metaphors for their functionality, for example a “Lightbox”; getting out of the tool “box” means taking advantage of the computer’s native capacity for interaction, iteration, emergence, and pattern recognition
Text mining and visualization asks the humanities to acknowledge a shift from documentary to algorithmic forms of evidence
Visualization (graphesis) will be an essential scholarly genre for the 21st century
Interested in Working with NORA? : Interested in Working with NORA? Contact me at mgk@umd.edu
Watch for news and announcements
Check for our Web site at (probably) www.noraproject.org