1. Text Mining 1
Running head: TEXT MINING
Text Mining
Mark Sharp
Rutgers University, School of Communication, Information and Library Studies
2. Text Mining 2
Abstract
The general idea of text mining – getting small "nuggets" of desired information out of
"mountains" of textual data without having to read it all – is nearly as old as information retrieval
(IR) itself. Currently text mining is enjoying a surge of interest fueled by the popularity of the
Internet, the success of bioinformatics, and a rebirth of computational linguistics. It can be
viewed as one of a class of nontraditional IR strategies which attempt to treat entire text
collections holistically, avoid the bias of human queries, objectify the IR process with principled
algorithms, and "let the data speak for itself." These strategies share many techniques such as
semantic parsing and statistical clustering, and the boundaries between them are fuzzy.
Therefore in this paper several related concepts are briefly reviewed in addition to text mining
proper, including data mining, machine learning, natural language processing, text
summarization, template mining, theme finding, text categorization, clustering, filtering, text
visualization, and text compression. Current text mining systems per se appear to be fairly
primitive, but to have the following goals which may serve as a useful definition to distinguish
text mining from other IR concepts: (1) to operate on large, natural language text collections; (2)
to use principled algorithms more than heuristics and manual filtering; (3) to extract
phenomenological units of information (e.g., patterns) rather than or in addition to documents;
(4) to discover new knowledge. Interest in text mining for biomedical research purposes is
especially pervasive and can be viewed as a major new frontier in bioinformatics. Text mining
systems designed for use with science and technology text databases such as MEDLINE
currently seem to have an undue emphasis on expert human filtering which contradicts goal (2).
Whether this represents premature surrender to difficulty or a necessary temporary expedient
remains to be seen.
4. Text Mining 4
Text Mining
Why Text Mining?
It has become a cliché to describe information space and the challenge of navigating it in
dramatic, even histrionic terms ("explosion," "avalanche," "flood," and the like), especially with
regard to scientific, technical, and scholarly literature. We moderns may like to think we are the
first to face this problem, but scientists have always complained about keeping up with their
literature (Saracevic, 2001). The promise of better science through better information technology
has been a major theme in information science since Vannevar Bush (1945) proposed his famous
Memex machine to deal with the "growing mountain of research."
Text mining is data mining applied to textual data. Text is "unstructured, amorphous, and
difficult to deal with" but also "the most common vehicle for formal exchange of information."
Therefore, the "motivation for trying to extract information from it is compelling – even if success
is only partial …. Whereas data mining belongs in the corporate world because that's where most
databases are, text mining promises to move machine learning technology out of the companies
and into the home" as an increasingly necessary Internet adjunct (Witten & Frank, 2000) – i.e., as
"web data mining" (Hearst, 1997). Laender, Ribeiro-Neto, da Silva, and Teixeira (2001) provide a
current review of web data extraction tools.
Text mining is one of a class of what I will call "nontraditional information retrieval (IR)
strategies." The goal of these strategies is to reduce the effort required of users to obtain useful
information from large computerized text data sources. Traditional IR often simultaneously
retrieves both "too little" information and "too much" text (Humphreys, Demetriou, & Gaizauskas,
2000). The nontraditional strategies represent a "broader definition of IR" and the view that "a
truly useful system must go beyond simple retrieval" (Liddy, 2000). I see them as treating the
5. Text Mining 5
entire database or collection more holistically, recognizing that the selectivity of anthropogenic
queries has a downside or bias which can be counterproductive to obtaining the best information,
and attempting to "objectify" the IR process with principled algorithms.1 I like to think that they
try to "let the data speak for itself."
When I started to research this paper I made a list of all the IR concepts (traditional and
non-) that were explicitly related to text mining by the first wave of authorities I identified. It was
a daunting list (Table 1), but I thought it would be possible to rule them all either "in" or "out" and
thus define their boundaries and hierarchical relationships to text mining. However, it soon
became clear that the boundaries were fuzzy, the hierarchy was a mass of convoluted loops, and
even seemingly outlandish claims to text mining relevance had, on closer inspection, a grain of
truth.2 Therefore I decided to try to cover them all instead of focusing on text mining proper,
whatever that turned out to be. Fortunately, time and literature resource limitations intervened to
significantly curtail this plan. Hopefully the result will serve as a sensible compromise.
History of Text Mining
H. P. Luhn (1958), in a seminal paper on automatic abstracting, noted "the resolving power
of significant words" in primary text. Lauren B. Doyle (1961) also captured the spirit of text
mining and related methods when he said that "natural characterization and organization of
information can come from analysis of frequencies and distributions of words in libraries"
1
E.g., "'Objectivity' [means] the results solely depend on the outcome of the linguistic processing algorithms and
statistical calculations" (Dorre, Gerstl, & Seiffert, 1999). I recognize that such computational exotica, stripped of their
mathematical mystique, "can be regarded as a form of transformed cognitive structure" (Ingwersen & Willett, 1995)
and are therefore ultimately just as human and arbitrary as the traditional methods. But I also believe that there can be
degrees of objectivity (operationally defined as general validity or utility) and that in general abstract computational
approaches will tend to be more objective.
2
There is one website, however, that goes too far. Greenfield (2001) lists virtually every text processing and
database technology I have ever heard of under the title "Text Mining." As a kind of rite of passage into the subject,
Patrick Perrin asked me to look at it and tell him if all of that was really text mining, so apparently it's somewhat
notorious in the field.
6. Text Mining 6
("libraries" representing what we would now more generally call collections or corpora). Text
mining per se may be new, but the dream of training a computer to extract information from
"mountains" of textual data is nearly as old as IR itself.
Don R. Swanson (1988) articulated the idea that the scientific literature should be regarded
as a natural phenomenon worthy of "exploration, correlation, and synthesis." He contrasted
scientists' attitudes toward information usage with those of intelligence analysts.
'To the working scientist or engineer, time spent gathering information or writing reports is
often regarded as a wasteful encroachment on time that would otherwise be spent
producing results that he believes to be new' [Weinberg et al, 1963] …. The intelligence
analyst, by contrast, is much more intimate with the available base of recorded information.
New knowledge, or finished intelligence, is seen as emerging from large numbers of
individually unimportant but carefully hoarded fragments that were not necessarily
recognized as related to one another at the time they were acquired. Use of stored data is
intensively interactive; "information retrieval" is an inadequate and even misleading
metaphor. The analyst is continually interacting with units of stored data as though they
were pieces selected from a thousand scrambled jigsaw puzzles. Relevant patterns, not
relevant documents, are sought.
Swanson called upon scientists to be more like intelligence analysts; to "take seriously the idea that
new knowledge is to be gained from the library as well as the laboratory [and] to develop attitudes
toward information indistinguishable from attitudes toward research itself."
Not content to lecture scientists from a theoretical pedestal, by the time these words were
published Swanson had already put the idea into practice by developing a system to discover
meaningful new knowledge in the biomedical literature (see references in Swanson & Smalheiser,
1999). Software now called ARROWSMITH and freely available on the web
(http://kiwi.uchicago.edu) helps by finding common keywords and phrases in "complementary and
noninteractive" sets of articles or "literatures" and juxtaposing representative citations likely to
reveal interesting co-occurrences. Two literatures are "complementary if together they can reveal
useful information not apparent in the two sets considered separately" – e.g., one may reveal a
7. Text Mining 7
natural relationship between A and B, and the other a relationship between B and C, so that
together they suggest a relationship between A and C. The two literatures are "noninteractive" if
their articles do not cross-cite and are not co-cited elsewhere in the literature. Swanson has
discovered at least three biomedically important relationships using this system: between fish oil
and Raynaud's syndrome, magnesium and migraines and epilepsy, and arginine and somatomedin
C (Lindsay & Gordon, 1999). Most recently he has used it to identify several dozen viruses as
potential bioweapons (Swanson, Smalheiser, & Bookstein, 2001).
Swanson's system remains far from fully automated, it is highly medical domain-specific,
and to my knowledge Swanson has never referred to it as text mining. But I believe it meets the
criteria at least partially (see below), and Swanson has been recognized as an early pioneer by self-
described text mining practitioners Marti Hearst (1999) and Ronald Kostoff (1999). I would like
to go further and propose that, because of the ideas he expressed in his 1988 JASIS paper,
Swanson is the father of modern text mining.
What is Text Mining?
Text mining per se is new and is still defining itself. It "has the peculiar distinction of
having a name and a fair amount of hype but as yet almost no practitioners" (Hearst, 1999), and
most of the information about it on the web is "misleading" (Perrin, 2001). The mining metaphor
"implies extracting precious nuggets of ore from otherwise worthless rock" (Hearst, 1999), "gold
hidden in … mountains of textual data" (Dorre, Gerstl, & Seiffert, 1999), or the idea that "the
computer rediscovers information that was encoded in the text by its author" (IBM, 1998b).
Hearst (1997, 1999) has argued for a narrow definition of text mining which distinguishes
it from "information access" (traditional IR). Traditional IR is concerned primarily with the
8. Text Mining 8
retrieval of documents (perhaps it should be called "DR"!) relevant to a user's information need,
but getting the desired information out of the documents is left entirely up to the user. According
to Hearst, data mining (of which text mining is a subtype, see below) not only deals directly with
the information, it tries to discover or derive new information from the data (text) which was
previously unknown even to the author(s) of the data (text[s]). She says "data mining is
opportunistic, whereas information access is goal-driven" and that IR tricks such as clustering,
finding terms for query expansion, and co-citation analysis are not text mining, although they can
aid it by improving the target dataset. Thus, IR can be viewed as a complementary technique
supporting text mining, rather than its broader term.
Text mining always involves (a) getting some texts relevant to the domain of interest
(traditional IR); (b) representing the content of the text in some medium useful for processing
(natural language processing, statistical modeling, etc.); and (c) doing something with the
representation (finding associations, dominant themes, etc.) (Perrin, 2001).
IBM is marketing a product named "Intelligent Miner for Text" (IBM, 1998a,b; Dorre et
al, 1999). It is a set of tools which "can be seen as information extractors which enrich
documents with information about their contents" in the form of structured metadata. "Features"
are classes of data which can be extracted, such as the language of the text, proper names, dates,
currency amounts, abbreviations, and "multiword terms" (significant phrases). The feature
extraction component is "fully automatic – the vocabulary is not predefined." It may operate on
single documents or on collections of documents. Word counts are based on normalization to
canonical forms (e.g., surgeries, surgical, and surgically might all be normalized to surgery).
The phrase extractor "uses a set of simple heuristics… based on a dictionary containing part-of-
speech information for English words [and] simple pattern matching to find expressions having
9. Text Mining 9
the noun phrase structures characteristic of technical terms. This process is much faster than
alternative approaches." There is also a clustering tool, a classification tool, and a search engine/
web crawler. The clustering similarity measure is based on "lexical affinities" – correlated
groups of words which appear frequently within a short distance of each other and which can be
used to label the clusters.
Lindsay and Gordon (1999) and Kostoff (1999) have extended Swanson's approach
without calling it text mining, but Kostoff's other work explicitly uses that label and so he serves
as a kind of bridge. Swanson's system is essentially as follows: MEDLINE searches are done on
two subjects (say, magnesium and migraines) and the results (titles or abstracts) are dumped into
ARROWSMITH, which generates a list of all significant words and phrases common to the two
result sets, and uses this information to "juxtapose pairs of text passages for the user to consider
as possibly complementary" (Swanson & Smalheiser, 1999). Lindsay and Gordon (1999) added
lexical frequency statistics (tf*idf) to rank the common words and phrases by probable
discriminatory value, but their system, like Swanson's, still requires "human filters" at several
points.
Kostoff and co-workers have published several papers on the Web describing various text
mining systems and applications. Losiewicz, Oard, and Kostoff (2000) describe a "TDM [text
data mining] architecture that unifies information retrieval from text collections, information
extraction from individual texts, knowledge discovery in databases, knowledge management in
organizations, and visualization of data and information." What they mean by "unifies" is
unclear, but this statement clearly betokens a broad view of text mining, almost as a synonym for
the entire family of nontraditional IR strategies. The "TDM architecture" they describe includes
subsystems for data collection (source selection and text retrieval), data warehousing
10. Text Mining 10
(information extraction and data storage), and data exploitation (data mining and presentation).
It thus appears to be a system for extracting and analyzing metadata. The authors discuss
linguistic analysis and numerous exotic pattern-finding techniques, but these appear to be long-
range goals. Current work focuses on the more pedestrian challenges of relevance feedback
("simulated nucleation"), bibliometrics, and phrase extraction and statistics. The system is "time
and labor intensive" by the authors' own admission, "requires the close involvement of technical
domain experts(s)" at every level of processing, and aims for a "main output [consisting of]
technical experts who have had their horizon and perspectives broadened substantially through
participation in the data mining process. The data mining tools, techniques and tangible products
are of secondary importance…"
Kostoff, Toothman, Eberhart, and Humenik (2000) connect text mining to "database
tomography," a system for phrase extraction and proximity analysis. The authors capture the
spirit of text mining when they say "techniques that identify, select, gather, cull, and interpret
large amounts of technological information semi-autonomously can expand greatly the
capabilities of human beings…" The idea of "tomography" also evokes text visualization, an
important nontraditional IR strategy related to text mining (see below). The authors cite
unpublished studies showing that in "real-world text mining applications" there is a "strong de-
coupling of the text mining research performer from the text mining user. The performer tended
to focus on exotic automated techniques, to the relative exclusion of the components of judgment
necessary for user credibility and acceptance." Users tended to favor simpler techniques, even if
it meant "reading copious numbers of articles." Database tomography aims to couple text mining
research and technology more closely with the user through "heavy involvement of topical
domain experts (either users or their proxies)" in the development of "strategic database maps"
11. Text Mining 11
on the "front end." "The authors believe that this is the proper use of automated techniques for
text mining: to augment and amplify the capabilities of the expert by providing insights to the
database structure and contents, not to replace the experts by a combination of machines and
non-experts."
Kostoff and DeMarco (2001) define science and technology text mining as "the
extraction of information from technical literature." It has three components: information
retrieval (gathering relevant documents), information processing, and information integration.
"Information processing is the extraction of patterns from the retrieved records" by bibliometrics,
computational linguistics, and clustering. "Information integration is the synergistic combination
of the information processing computer output with the [human] reading of the retrieved relevant
records. The information processing output serves as a framework for the analysis, and the
insights from reading the records enhance the skeleton structure to provide a logical integrated
product." Again, "substantial manual labor" is noted, and technical details are not given, leaving
doubt as to what kind of and how much "computational linguistics" and "clustering" were
actually implemented. This work was also published under the title "Citation mining: Integrating
text mining and biliometrics for research user profiling" by Kostoff, del Rio, Humenik, Garcia,
and Ramirez (2001).
In all of Kostoff's articles, there is a disturbingly high ratio of shifting, florid, technical
jargon and speculation to actual accomplishment. He seems to be re-inventing several well
established techniques such as relevance feedback, co-citation analysis, and phrase extraction,
giving them flashy new names, and failing to cite prior work by others. It is often unclear where
the boundary is between the computer and human filtering, particularly in Kostoff's phrase
extraction process. Given the authors' constant emphasis on the importance of human judgment
12. Text Mining 12
it seems likely that they have not automated the phrase selection process at all, and therefore
have not added anything to classical word proximity analysis for phrase identification.
Unrestricted human filtering or intervention in what are supposed to be algorithmic processes is,
in some sense, a form of "fudging" or "cheating." It is antithetical to the goals of standardizing
and objectifying the IR process, and it is hard to see how it contributes anything progressive to
text mining research. This is not to disagree with Kostoff about the importance of domain
expertise and user credibility and acceptance, only to caution against using such concerns as a
figleaf for excessively primitive IR technology.
Based on the foregoing, I propose the following criteria for a true text mining system.
The keywords are highlighted.
• It must operate on large, natural language text collections.
• It must use principled algorithms more than heuristics and manual filtering.
• It must extract phenomenological units of information (e.g., patterns) rather than or in
addition to documents.
• It must discover new knowledge.
It is to be expected that different systems will meet these criteria to different extents.
Currently Swanson's and Kostoff's systems are on shaky ground on at least the first two, possibly
three. Perhaps text mining, by these criteria, is still more dream than reality. So let's look at
some related concepts.
Data Mining
It seems fairly noncontroversial that text mining is a subdiscipline of the broader and
slightly older field of data mining, the subdiscipline which deals with textual data. An
13. Text Mining 13
intermediate evolutionary lexical form, in fact, is "text data mining" (Hearst, 1999; Losiewicz et al,
2000). The mining metaphor implying "extracting precious nuggets of ore from otherwise
worthless rock" is actually more appropriate for text mining than for data mining, which tends to
deal with trends and patterns across whole databases (Hearst, 1999).
Data mining is considered a synonym for "knowledge discovery in databases" (KDD) by
some writers (e.g. Hearst, 1999) and as a narrower term by others (e.g. Liddy, 2000). The most
cited definition of KDD is that given by Fayyad, Piatesky-Shapiro, and Smyth (1996, cited by Qin,
2000, and Hearst, 1997): the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data. "Information archaeology" is a synonym for both data
mining and KDD, according to Hearst (1999). Two unusually practical, down-to-earth books on
data mining are Witten and Frank (2000) and Han and Kamber (2001) (Perrin, 2001).
Data mining usually deals with structured data, but text is usually fairly unstructured. The
crux of the text mining problem, then, can be viewed as imposing structure on text to make it
amenable to the analytic techniques of data mining. This is often conceptualized as extracting
metadata from text (Losiewicz et al, 2000).
Machine Learning
Data mining is based on a variety of computational techniques, some of which fall under
the rubric of machine learning. Examples are decision trees, neural networks, and association rules
(clustering). In this context, machine learning involves "the acquisition of structural descriptions
from examples [which] can be used for prediction, explanation, and understanding." When the
description can be used to classify the examples, all three are enabled, unlike purely statistical
modeling which only supports prediction. By some views, however, machine learning is little
14. Text Mining 14
more than practical statistics as it evolved in the field of computer science; i.e., with an emphasis
on searching "through a space of possible concept descriptions for one that fits the data" (Witten &
Frank, 2000).
From a broader artificial intelligence (AI) perspective, machine learning is one of the four
capabilities needed for an AI system such as a robot to pass the "Turing test" – that is, to appear
logical, rational, and intelligent to an intelligent human interrogator. In this context machine
learning involves the ability "to adapt to new circumstances and to detect and extrapolate patterns"
(Russell & Norvig, 1995).
From a biomedical research perspective, Mjolsness and DeCoste (2001) define machine
learning is "the study of computer algorithms capable of learning to improve their performance of
a task on the basis of their own previous experience" primarily through pattern recognition and
statistical inference. They see a legitimate future role for it in "every element of scientific method,
from hypothesis generation to model construction to decisive experimentation." Text mining
could help with the "high data volumes" involved in literature searching. However, most work to
date has focused on experimental data reduction such as visualization of high-dimensional vector
data resulting from gene expression microarray studies (see footnote 6, p. 25).
Natural Language Processing
Natural language processing (NLP) or understanding (NLU) is the branch of linguistics
which deals with computational models of language. A brief history is given by Bates (1995).
Its motivations are both scientific (to better understand language) and practical (to build
intelligent computer systems). NLP has several levels of analysis: phonological (speech),
morphological (word structure), syntactic (grammar), semantic (meaning of multiword
15. Text Mining 15
structures, especially sentences), pragmatic (sentence interpretation), discourse (meaning of
multi-sentence structures), and world (how general knowledge affects language usage) (Allen,
1995). When applied to IR, NLP could in principle combine the computational (Boolean, vector
space, and probabilistic) models' practicality with the cognitive model's willingness to wrestle
with meaning. NLP can differentiate how words are used such as by sentence parsing and part-
of-speech tagging, and thereby might add discriminatory power to statistical text analysis.
Clearly, NLP could be a powerful tool for text mining. Interest in it for that purpose is
widespread but the jury remains out.
Rau (1988) described an early NLP system named SCISOR which was developed by
General Electric. Limited applicability to "constrained domains" was emphasized; SCISOR was
programmed to deal only with information on corporate mergers. Input (news stories, etc.) was
described as being converted to "conceptual format" permitting natural language interrogation
(i.e., question answering) and summarization. SCISOR employed a parallel strategy of top-down
(expectation-driven conceptual analysis) and bottom-up (partial linguistic analysis) parsing.
Parsing is the identification of subjects, verbs, objects, phrases, modifiers, etc., within sentences.
Computerized parsing of free text "is an extremely difficult and challenging problem," according
to Rau. The two parsers in SCISOR interacted with a domain-specific knowledge base
containing grammatical and lexical information. The double parsing strategy of SCISOR
allowed flexibility to perform in-depth analysis when complete grammatical and lexical
knowledge is available, and superficial analysis when unknown words and syntax are
encountered, giving the system robustness. The top-down parser could also be used for text
skimming (looking for particular pieces of information).
However, semantic analysis "is very expensive and furthermore depends on a lot of
16. Text Mining 16
domain-dependent knowledge that has to be constructed manually or obtained from other sources"
(IBM, 1998a). Early NLP's image also suffered from the poor performance of phrase-based
indexing in comparison with stemmed single words in the Cranfield and SMART tests (Salton,
1992). Interest in NLP revived when request-oriented (as opposed to document-oriented) IR came
of age and it was realized that the limitations of the linguistic techniques did not prevent them from
being effective within restricted subject domains (Ingwersen and Willett, 1995). Unlike its more
successful sibling field of speech recognition, NLP has the severe disadvantages of diffuse goals
and lack of robust machine learning algorithms (Bates, 1995). There seems to be wide consensus
that NLP is still not competitive with statistical approaches to traditional IR, but that it may be
practical and even critical for applications such as phrase extraction and text summarization. Even
Salton, the godfather of statistical IR, said, "In the absence of deep linguistic analysis methods that
are applicable to unrestricted subject areas, it is not possible to build intellectually satisfactory text
summaries" (Salton, Allan, Buckley, & Singhal, 1994).
Liz Liddy (2000, 2001) has become a prominent advocate for NLP in text mining. Her
definition of the goal of text mining, in fact, is "capturing semantic information" as tabular
metadata amenable to statistical data mining techniques. In her work, NLP includes stemming
(morphological level), part-of-speech tagging (syntactic level), phrase and proper name
extraction (semantic level), and disambiguation (discourse level). Goals include automating text
mark-up for hypertext linkages in digital libraries, and machine learning algorithms for text
classification (see below).
A "reverse flow" of purely statistical methods to NLP has been going on since about
1990 and has made "substantial contributions" (Kantor, 2001), increasing interest in hybrid
approaches (Marcus, 1995; Losee, 2001a; Perrin, 2001). Statistical enrichment has been shown
17. Text Mining 17
to significantly improve the accuracy of proper name classification, part-of-speech tagging, word
sense disambiguation, and parsing under certain conditions (Marcus, 1995), and tagging and
disambiguation improve probabilistic document retrieval ranking discrimination by some parts of
speech (Losee, 2001a). Ultimately, lexical statistics are a reflection of term dependencies which
in turn reflect natural languages' relation to "naturally occurring dependencies in the physical
world" (Losee, 2001b). However, higher-level NLP proved far inferior to "shallow" tricks like
stemming and query expansion in improving the performance of an advanced IR system under
rigorous test conditions (Perez-Carballo & Strzalkowski, 2000).
Computational linguistics is used as a synonym for NLP by some writers and as a
narrower term by others. According to Hearst (1999), it is the branch of NLP which deals with
finding statistical patterns in large text collections to inform algorithms for NLP techniques such
as part-of-speech tagging, word sense disambiguation, and bilingual dictionary creation; i.e.,
computational linguistics is a form of text mining. Thus, to Hearst and Liddy, text mining
subserves NLP, rather than the reverse. Both Hearst and Liddy refer often to metadata as being
the bridge between NLP and statistics. They both envision text mining as a component of a full-
featured information access system which also includes source detection, content retrieval, and
analytical aids such as text visualization (see below).
A major problem in text analysis is "dangling anaphors" – pronouns and demonstratives
(this, that, the latter, etc.) which refer back to other sentences (Johnson, Paice, Black, & Neal,
1993). Therefore a good job for NLP would be to detect anaphors and search backwards to
resolve their referent. In the language of logic, this might be called identifying the point in the
text where each significant new proposition begins. In 1993, that was beyond available text
processing capabilities, so the authors had to exclude anaphoric sentences from further analysis
18. Text Mining 18
regardless of their information content.
In summary, all this activity and interest raise hopes, but NLP still "has not delivered the
goods" (Saracevic, 2001) and so the jury remains out.
Text Summarization
An obvious example of text mining would be to find previously unknown natural
correlations by looking at co-occurrences of themes in a corpus of texts. Before one can do that,
of course, one must identify the themes. A theme being a form of summary, automated theme-
finding is a form of automatic text summarization (or automatic abstracting), a proud old IR
tradition.
Johnson, Paice, Black, and Neal (1993) trace the history of automatic abstract generation
from Luhn (1958), who proposed extracting sentences based on their computed word content
weights, and Baxendale (1958, cited by Johnson et al, 1993), who drew attention to the
importance of the first and last sentences of paragraphs. Edmundson (1969, cited by Johnson et
al, 1993) found that both of these methods were inferior to extraction on the basis of cues (bonus
words and stigma words). Paice (1981, cited by Johnson et al, 1993) sharpened Edmundson's
idea of cues to "indicator constructs" such as In this paper we show that…
Johnson et al (1993) built a NLP-based auto-abstracting system which selected non-
anaphoric, indicator-containing sentences and ran them through a bottom-up parser, dictionary-
based part-of-speech tagger (noun, verb, etc.) and morphology-based tagger (-ly = adverb, etc.).
Each word was then indexed by its sentence number, position within the sentence, part of speech,
verb tense if applicable, and whether it was plural or singular. The result was then be "cleaned
19. Text Mining 19
up" by a set of corrective heuristics and a grammar-based tag disambiguator3. A global parser
then identified noun phrases based on definitive cues such as being separated by a preposition
(e.g., the primary factor in public health), and then parsed the sentence. The resulting sample
abstract was "far from perfect" as the authors admitted, but it was a plausible condensation down
to 22% of the original text size. Since 22% is an inadequate degree of data reduction for most
text summarization needs, the next step might be to take a page from statistical IR and develop
ways of ranking the selected sentences.
Template mining
SCISOR's (Rau, 1988) text summarization capabilities were based on filling in values
specified by domain-dependent, manually formulated "scripts" – e.g., company A offered B
dollars per share in a takeover bid for company C on date D. The values were extracted from
raw text by parsing and stored in relational data tables. Then summaries of the parsed data
values could be written by a natural language generator. This seems to be a form of template
mining, where the script or metadata table field structure constitutes the template.
Chowdhury (1999) describes template mining as a form of information extraction using
NLP "to extract data directly from the text if either the data and/or text surrounding the data form
recognizable patterns. When text matches a template, the system extracts data according to the
instructions associated with that template." Chowdury traces its history from the mid-1960s
Linguistic String Project at New York University, where "fact retrieval" was conducted against
template data mined from natural language text, up to its current (1999) use in the AltaVista and
3
An example of a sentence with intractable tag ambiguity would be Rice flies like sand, which could refer to the
behavior of grain or insects (Allen, 1995, p. 13). Such a sentence would require higher (pragmatic and discourse)
levels of analysis to disambiguate.
20. Text Mining 20
Ask Jeeves web search engines. .He cites some of the same work I reviewed under NLP and
below (the Rau, Paice, and Gaizauskas groups) perhaps implying that template mining is a
general term for NLP-based metadata approaches to text mining. He also cites Croft (1995) in
reference to the U.S. Advanced Research Projects Agency (ARPA) initiative in this area, the
Message Understanding Conferences (MUCs).
To facilitate template mining, Chowdhury recommends "standardization in the
presentation and layout of information within digital documents" through the use of templates for
document creation. But this is contrary to the spirit of text mining, which is to liberate both the
creators and the users of text from as much tedium and artificiality as possible. Like Kostoff's
unrestricted reliance on human filters, it represents a form of surrender in the face of difficulty –
hopefully premature!
Theme Finding
Salton, Allan, Buckley, and Singhal (1994) looked at how traditional IR models can be
applied to theme generation and text summarization. The authors derived the notion of passage
retrieval from the problem of ranking vector matches when the vectors are of different lengths,
e.g. very short queries against long documents, or clustering documents of different sizes. One
solution is to decompose the documents into subunits of roughly equal size, called "passages." A
common passage unit is a paragraph.
The passages may be converted to normalized vectors and compared. Those with
similarities above a certain threshold (which may be chosen to deliver a desired degree of
abstraction) are considered connected. If the documents are plotted as arcs on the circumference
of a circle and their component passages connected by straight lines in accordance with their
21. Text Mining 21
vector similarities, the resulting starburst pattern can convey themes within and between
documents. These themes can be focused by expressing each triangle of passage similarities
as a centroid and doing similarity calculations on the centroids.
One may want to compute an estimate of the "most important" passages for the purpose
of selective text traversal ("skimming") or text summarization. Such passages might be
identified as (a) having a large number of above-threshold similarity connections, (b) strategic
position (e.g., the first paragraph in each section), or (c) high similarity to some reference node.
The last criterion (c) is called "depth first" selection. In practice, all three of these criteria can be
combined; e.g., start with some desired passage (as in "more like this"), go to the most similar
sectional heading passage, then go to its strongest link, the select the other densely connected
nodes in that cluster in chronological order. For text summarization, repetition can be edited out
on the basis of similarities between sentences or other subunits which are "too high."
Text Categorization
Text categorization should not be considered a form of text mining because it is a
"boiling down" of document content to "pre-defined labels" which "does not lead to discovery of
new information" since "presumably the person who wrote the document knew what it was
about," according to Hearst (1999). Presumably she would also rule out text summarization and
auto-indexing for the same reason. She makes exceptions, however, for cases where the goal of
categorization is to find "unexpected patterns" or "new events" because these "tell us something
about the world, outside of the text collection itself" and therefore qualify as new information.
I would argue, however, that it is not so easy to predict where "new information" will
come from, that novelty is in the eye of the beholder, and that any form of text data reduction is a
22. Text Mining 22
form of separating "precious nuggets" from "worthless rock" according to the human
idiosyncrasies of whoever is doing the separating, be it a traditional library cataloguer/indexer or
a vector space modeler. This is not to say that cataloguing, indexing, and other IR tools are all
text mining, but just to highlight the fuzziness of the boundaries between them.
Clustering
Clustering can be used to classify texts or passages in natural categories that arise from
statistical, lexical, and semantic analysis rather than the arbitrarily pre-determined categories of
traditional manual indexing systems. In the context of text mining, it is the derivation of the
categories which is of interest, since this is a form of theme finding and therefore text
summarization. Once the texts are clustered on the basis of common themes, it may also be useful
to correlate their divergent themes, a la Swanson. Texts may also be clustered on the basis of
length, cost, date, etc. (IBM, 1998b), or bibliographic data such as author, institution, or country of
origin (Kostoff, 1999). Computational aspects of clustering are reviewed by Witten and Frank
(2000, Section 6.6).
Filtering
E-mail filtering is often mentioned as an example of text mining (e.g., Witten and Frank,
2000). The relevance of related techniques such as name recognition, theme finding, and text
categorization are obvious, and it is even possible to imagine software which modifies its own
filtering criteria by discovering new patterns in the whole e-mail stream. However, I was unable
to find reports of any actual work on such a system.
Belkin and Croft (1992) built a model of information filtering (IF) based on Belkin's
23. Text Mining 23
famous anomalous states oif knowledge (ASK) model of IR. In a side-by-side comparison, the
two (IF and IR) appear strikingly similar, the biggest difference being the "stable, long-term…
regular information interests" of IF compared to the "periodic… information need or ASK" of
IR. Extending the side-by-side modeling to Bayesian inference networks, the authors arrive at
another striking comparison: the IF network looks exactly like an upside-down IR network! That
is, in IR multiple documents are percolating down to a single user, while in IF each single
incoming document is percolating down to multiple users. However, the authors reject this
analogy for reasons not entirely clear to me.4
Text Visualization
Text visualization shares text mining's goals of using computational transformations to
reduce the cognitive effort of dealing with large text corpora, highlight patterns across
documents, and help discover new knowledge. Text mining implies homing in on "precious
nuggets" whereas text visualization seems to be concerned with the "big picture," but in practice
both may be regarded as elements of a holistic approach to multi-text corpora. The text mining
systems of Hearst, Kostoff, and Liddy all have explicit text visualization components.
Wise (1999) developed a text visualization paradigm for intelligence analysis named
Spatial Paradigm for Information Retrieval and Exploration (SPIRE) "to find a means of
‘visualizing text’ in order to reduce information processing load and to improve productivity" by
representing large numbers of documents to permit "rapid retrieval, categorization, abstraction,
and comparison, without the requirement to read them all." The theory behind SPIRE was that
4
They seem to feel that "P(oj|pi)", the probability that the incoming document will satisfy the information need
given a user's filtering profile, is poorly understood compared to the conventional Bayesian need-query-document
relationships, but I'm not sure the latter are so well-understood, either.
24. Text Mining 24
humans’ most highly evolved perceptual abilities are those involved in interpreting "visual
features of the natural world." Therefore the goal was to represent text as natural, ecological
images from our early hominid past which require no "prolonged training to appreciate and use"
such as star fields or landscapes (Figure 1). This transformation was accomplished using
standard vector space algorithms and involves clustering and text summarization. SPIRE is an
excellent example of how a cognitive theory can be helpful in inspiring IR innovation and
guiding system development, despite its apparent lack of commercial success.5
Text Compression
As mentioned at the beginning, I started this paper by trying to narrow the definition and
scope of text mining by differentiating it from other nontraditional IR strategies (Table 1). One
by one, however, the other strategies refused to be cleanly differentiated, and the foregoing
polyglot review is the result. The only concept I thought I had succeeded in banishing from the
scope of text mining was data compression, which showed up in the title of a single citation in a
literature search performed for me by Melissa Yonteck. Data compression, a la PKZIP, was
surely not related in any meaningful way to text mining, Yonteck and I agreed. Here at last was
something I could confidently rule out.
But on page 334, Witten and Frank (2000), in discussing statistical character-based
models for token classification (names, dates, money amounts, etc.), note that "there is a close
connection with prediction and compression: the number of bits required to compress an item
with respect to a model can be interpreted as the negative logarithm of the probability with which
that item is produced by the model." That is, text compression algorithms might function as
5
Cartia, Inc., which was marketing the ThemeScape™ software (Figure 2, downloaded Fall 2000), no longer has
any detectable presence on the Web.
25. Text Mining 25
token classifiers in reverse! So I give up. Text mining appears to be related to just about
everything on my original list.
Biomedical Applications
My interest in text mining is motivated primarily by the belief that it can be fruitfully
applied to biomedical literature, specifically the MEDLINE database, to discover new knowledge.
I see text analysis as a major new frontier in bioinformatics, whose smashing success in the area of
gene sequence analysis is based, after all, on nothing more than algorithms for finding and
comparing patterns in the four-letter language of DNA. Swanson's work has focused on
MEDLINE, and Hearst (1999) has also declared a research interest in "automating the discovery of
the function of newly sequenced genes" by determining which novel genes are "co-expressed with
already understood genes which are known to be involved in disease."
Humphreys, Demetriou, and Gaizauskas (2000) used information extraction, defined as
"extracting information about predefined classes of entities and relationships from natural
language texts and placing this information into a structured representation called a template" [is it
therefore template mining?], to build a database of information about enzymes, metabolic
pathways, and protein structure from full text biomedical research articles. The LaSIE (Large
Scale Information Extraction) system includes modules for datatype recognition (names, dates,
etc.), co-reference resolution (pronouns, anaphors, metonyms, etc.), and different types of template
filling. It does linguistic analysis at all levels up to discourse using lexical knowledge,
morphology, and grammars to identify significant words. The enzyme and metabolic pathway
variant of LaSIE is called (of course) EMPathIE and fills the following template fields: enzyme
name, EC (Enzyme Commission) number, organism, pathway, compounds involved and their roles
26. Text Mining 26
(substrate, product, cofactor, etc.), and, interestingly, compounds not involved. Optional fields
include concentration and temperature. The PASTA variant deals with protein structure
information such as which amino acid residues occupy given positions, active and binding sites,
secondary structure, subunits, interactions with other molecules, source organism, and SCOP
category. The prototype has been tested on only six journal papers, so it is far from satisfying the
large text corpus requirement for true text mining, but the authors make no such claim.
The U.S. National Institutes of Health (NIH) have also gotten involved. Tanabe, Scherf,
Smith, Lee, Hunter, and Weinstein (1999) developed a system named MedMiner to help them sort
out the thousands of gene expression correlations resulting from microarray experiments6 to
separate "interesting biological stories" from mere epiphenomena and statistical coincidences. The
first module gathers the relevant texts by querying PubMed (MEDLINE) and GeneCards (an
Israeli gene information database) on the expressed genes. [Gene names generally make good
search words because they are different from normal English words, e.g. "JAK3".] The second
module filters the retrieved texts by user-specifiable relevance criteria based on classical proximity
or term frequency scores (NLP criteria being regarded as too computationally expensive). The
third module is a "carefully designed user interface" to facilitate access to the most likely-to-be-
interesting documents.
Despite the name, then, MedMiner is not a true text mining system, but rather a search and
display enhancement to PubMed (which offers only flat Boolean search logic, unranked retrieval,
and no integration with GeneCards, although it is integrated with other gene and protein
databases). Like Kostoff's system, it is designed to deal with highly technical information by
assisting expert users in their traditional IR tasks rather than attempting to automate them
6
Basically, a square chip coated with an array of known DNA sequences at known locations on the chip is dipped
into a broth containing the expressed messenger RNA (mRNA) from cells under given conditions. The mRNA is
labeled so that when it binds to its complementary DNA on the chip the gene expression pattern is revealed. Gifford
(2001) briefly reviewed the direct application of data visualization to gene expression data not involving any text.
27. Text Mining 27
completely. MedMiner is freely available online at http://discover.nci.nih.gov.
Another NIH group, Rindflesch, Hunter, and Aronson (1999), developed a true NLP
system named ARBITER for mining molecular binding terms from MEDLINE. ARBITER
attempts to identify noun phrases representing molecular entities such as drugs, receptors,
enzymes, toxins, genes, messenger molecules, etc., and their structural features (box, chain,
sequence, subunit, etc.) likely to be involved in binding. ARBITER makes use of MeSH indexing,
the lexical and semantic knowledge bases of the Unified Medical Language System's (UMLS) and
GenBank, co-word adjacency to forms of bind, and a variety of linguistic strategies to deal with
acronyms, anaphors, modifiers, coordinated phrases, and nested phrases (e.g., "…a previously
unrecognized coiled-coil domain within the C terminus of the PKD1 gene product, polycystin, and
demonstrate…"). A test on a small sample (116 abstracts containing a form of bind, one month's
worth from MEDLINE) yielded 72% recall and 79% precision of manually marked binding terms.
While terminology extraction might be considered a fairly trivial form of text mining, it is
obviously a logical step toward the mining of binding relationships (A binds B) which would have
enormous potential for knowledge discovery.
Stapley and Benoit (2000) developed a system named “BioBiblioMetrics” (Stapley,
2000) which uses text visualization to suggest functional clusters of genes from the yeast
Saccharomyces cerevisiae. The system uses a subset of MEDLINE records containing the
yeast's name, a lexical knowledge base of all the known, nontrivial yeast genes and their aliases
from the SGD (Saccharomyces Gene Database), and a matrix of gene name pair co-occurrence
statistics. When one does a search on a gene name or function (e.g. "DNA replication"), the co-
occurring genes are displayed in a graph with “nodes” representing genes and edge lengths
between the nodes representing biological proximity (Figure 2). Nodes are hypertext-linked to
28. Text Mining 28
sequence databases, and edges to those MEDLINE documents that generated them, creating a
biomedical information “landscape” and inference network. BioBiblioMetrics is freely available
online at http://www.bmm.icnet.uk/~stapleyb/biobib/.
Other MEDLINE text mining papers which I did not have a chance to review in full
involve dictionary-controlled natural language processing for extraction of drug-gene relationships
(Rindflesch, Tanabe, Weinstein, & Hunter, 2000); statistical term strength analysis (Wilbur &
Yang, 1996); statistical text classification and a relational machine-learning method (Craven &
Kumlien, 1999); statistical identification of key phrases against an evolutionary protein family
background (Andrade & Valencia, 1997 & 1998); pre-specified protein names and a limited set of
action verbs (Blaschke, Andrade, Ouzounis, & Valencia, 1999); and a proprietary information
extraction system (Thomas, Milward, Ouzounis, Pulman, & Carroll, 2000). Futrelle (2001a)
provides online full-text access to many biomedical text mining papers, including those from the
hard-to-get 2000 and 2001 Pacific Symposia on Biocomputing.
Bob Futrelle (2001a,b) has organized a large "bio-NLP" information network and
enunciated a radical vision which includes several of the themes of this paper, such as the
analogy between text and genome analysis, and the long history of information extraction in its
many guises. He see the challenge as "understanding the nature of biological text, whatever that
turns out to be, linguistic theories not withstanding." He seems to feel that the traditional rules
and grammars of Chomskian linguistics are more hindrance than help.
Frankly, a fresh new approach is needed, fueled by the conviction that language is a
biological phenomenon, not a logical phenomenon. By this we mean that the nature of
language is as messy as the genome. The data and observed phenomena in all their richness
and variety are dominant and cannot subsumed by any elegant theories. This means that in
many ways, biologists have far better hopes of cracking the NLP problem than the
computational linguists, who are focused on mathematics and logic. Even when they look
at data, it is primarily as grist for their math mills.
29. Text Mining 29
Futrelle recommends, for example, building visualization tools such as a protein noun phrase
highlighter which could be used to "assemble a large collection of the standard textual
expression forms [and] map these onto the query forms for which they are the answers."
But Futrelle also goes beyond immediate practical needs. Like Wise (1999), he has a
coherent theory based on the biological nature of language.
By this I mean that language is a communicative capability of living organisms that has
evolved from deep biological roots and from social interactions over millions, and
ultimately, billions of years. I claim that language is not logical and mathematical,
because that's not the nature of the organism (us) that exhibits the language capability.
An example of this is found in our vocabularies. A technically skilled adult will have a
vocabulary of over 100,000 words, basically all memorized. The meaning of "bear" or
"ship" does not follow from the characters that make them up. We simply commit them
to memory. Linguists would like us to believe that our natural ability to "parse" is
radically different and can be explained as a rule-based system.
My radical view is that we understand language not by generalization to abstract rules as
much as by retaining examples and generalizing from them as needed. This is quite
within our capacity, given our 100,000 word vocabularies. We also do reason. I would
claim, again in the biological view, that this is done more by "imagined life" than by
logic. Humans have superb abilities to remember events and to build detailed mental
plans for future activities …. So we need to build this type of reasoning into our systems.
The analogy to genomics is clear. The coding of a particular protein by a particular
sequence of DNA bases is just an accident of evolution. Whatever rules now appear to prevail
(such as "zinc fingers" for DNA-binding proteins) can only be derived empirically, by looking
for patterns within the data. Purely logical approaches must wait for a richer knowledge base.
Only now, after the massive effort of half a century of molecular genetic research, sequencing
whole genomes, and building databases and tools such as GenBank, Gene Cards, and Proteome,
can we begin to think about prediction of protein structure and function from sequence data
alone. Biological linguistics now stands at the beginning of a comparably arduous journey.
These considerations put Swanson's, Kostoff's, Tanabe's, and Chowdhury's reliance on
human expertise and manual filtering in a better light. Perhaps they do not represent premature
30. Text Mining 30
surrender to difficulty so much as a necessary but hopefully temporary expedient. Perhaps they
are keeping "the human in the loop" (Kantor) only long enough to "study the human to learn
what to put in the machine" (Saracevic, 2001). This surprising interface between biomedical text
mining and the cognitive tradition in IR would make a worthy topic for another paper.
31. Text Mining 31
References
Allen, J. (1995). Natural Language Understanding, Second Edition. Redwood City, CA:
Benjamin/Cummings.
Andrade, M. A., & Valencia A. (1997). Automatic annotation for biological sequences
by extraction of keywords from MEDLINE abstracts. Development of a prototype system.
Proceedings of the international conference on intelligent systems for molecular biology 5:25-32.
Andrade, M. A., & Valencia, A. (1998). Automatic extraction of keywords from
scientific text: application to the knowledge domain of protein families. Bioinformatics
14(7):600-607.
Bates, M. (1995). Models of natural language understanding. Proceedings of the
National Academy of Sciences, 92, 9977-9982.
Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval:
Two sides of the same coin? Communications of the ACM, 35, 29-38.
Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. (1999). Automatic extract-
ion of biological information from scientific text: protein-protein interactions. Proceedings of
the international conference on intelligent systems for molecular biology, pp.60-67.
Bush, V. (1945). As We May Think. Atlantic Monthly, 176 (11), 101-108.
Cartia, Inc. (2000). ThemeScape product suite. Formerly online: http://www.cartia.com/
products/index.html [no longer accessible].
Chowdhury, G. G. (1999). Template mining for information extraction from digital
documents. Library Trends, 48, 182-208.
Craven, M., & Kumlien, J. (1999). Constructing biological knowledge bases by
extracting information from text sources. Proceedings of the International Conference on
32. Text Mining 32
Intelligent Systems for Molecular Biology, pp.77-86.
Dorre, J., Gerstl, P., & Seiffert, R. (1999). Text mining: Finding nuggets in mountains of
textual data. KDD-99, Association of Computing Machinery.
Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the
Association for Computing Machinery, 8, 223-239.
Fan, W. (2001). Text mining, web mining, information retrieval and extraction from the
WWW references. Online: http://www-personal.umich.edu/~wfan/text_mining.html
Futrelle, R. P. (2001a). Natural language processing of biology texts. Online:
http://www.ccs.neu.edu/home/futrelle/bionlp/
Futrelle, R. P. (2001b). The past, present and future of biology text understanding.
Presented at the Conference on Biological Research with Information Extraction (BRIE), Tivoli
Gardens, Copenhagen, Denmark, July 26. Online:
http://www.ccs.neu.edu/home/futrelle/brie2001/index.html
Gifford, D. K. (2001). Blazing pathways through genetic mountains. Science, 293,
2049-2051.
Greenfield, L. (2001). Text mining. Online: http://www.dwinfocenter.org/docum.html
Hearst, M. (1997). Distinguishing between web data mining and information access.
Presentation for the Panel on Web Data Mining, KDD 97, August 16, Newport Beach, CA.
Online: http://www.sims.berkeley.edu/~hearst/talks/data-mining-panel/index.htm
Hearst, M. (1999). Untangling text data mining. In Proceedings of ACL'99: the 37th
Annual Meeting of the Association for Computational Linguistics, University of Maryland, June
20-26, 1999 (invited paper). Online: http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-
tdm.html
33. Text Mining 33
Hearst, M. (2001). About TextTiling. Online:
http://www.sims.berkeley.edu/~hearst/tiling-about.html
Humphreys, K., Demetriou, G., & Gaizauskas, R. (2000). Bioinformatics applications of
information extraction for scientific journal articles. Journal of Information Science, 26, 75-85.
IBM (1998a). Text analysis tools. Slide #8 of Intelligent Miner for Text Overview.
Online:
http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23over/im4t23over8.htm
IBM (1998b). Text mining technology: Turning information into knowledge: A white
paper from IBM. Daniel Tkach (Ed.). Online:
http://www-4.ibm.com/software/data/iminer/fortext/download/whiteweb.pdf
Ingwersen, P., & Willett, P. (1995). An introduction to algorithmic and cognitive
approaches for information retrieval. Libri, 45, 160-177.
Johnson, F. C., Paice, C. D., Black, W. J., & Neal, A. P. (1993). The application of
linguistic processing to automatic abstract generation. Journal of Document and Text
Management, 1, 215-241.
Kantor, P. B. (2001). Lecture K: Natural language concepts. Information Retrieval class,
Rutgers University, School of Communication, Information, and Library Studies, New
Brunswick, NJ.
Kostoff, R. N. (1999). Science and technology innovation. Technovation, 19. Online:
http://www.dtic.mil/dtic/kostoff/Swanson2.txt
Kostoff, R. N., & DeMarco, R. A. (2001). Information extraction from scientific
literature with text mining. Analytical Chemistry (in press). Online:
http://www.onr.navy.mil/sci_tech/special/technowatch/kdocs/anchem2/txt
34. Text Mining 34
Kostoff, R. N., del Rio, J. A., Humenik, J. A., Garcia, E. O., & Ramirez, A. M. (2001).
Citation mining: Integrating text mining and biliometrics for research user profiling. Journal of
the American Society for Information Science, 52, 1148-1156.
Kostoff, R. N., Toothman, D. R., Eberhart, H. J., & Humenik, J. A. (2000). Text mining
using database tomography and bibliometrics: A review. Online:
http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm
KRDL (2001). Text mining: transforming raw text into actionable knowledge (white
paper). Kent Ridge Digital Labs. Online: http://textmining.krdl.org.sg/
Laender, A. H. F., Ribeiro-Neto, B., da Silva, A. S., & Teixeira, J. S. (2001). A brief
survey of web data extraction tools. In press.
Liddy, E. D. (2000). Text mining. Bulletin of the American Society for Information
Science, 27. Online: http://www.asis.org/Bulletin/Oct-00/liddy.html
Liddy, E. D. (2001). Data mining, meta-data, and digital libraries. DIMACS Workshop
on Data Analysis and Digital Libraries, May 17, Center for Discrete Mathematics and
Theoretical Computer Science, Rutgers University, New Brunswick, NJ.
Lindsay, R. K., & Gordon, M. D. (1999). Literature-based discovery by lexical statistics.
Journal of the American Society for Information Science, 50, 574-587.
Losee, R. M. (2001a). Natural language processing in support of decision-making:
phrases and part-of-speech tagging. Information Processing and Management, 37, 769-787.
Losee, R. M. (2001b). Term dependence: A basis for Luhn and Zipf models. Journal of
the American Society for Information Science, 52, 1019-1025.
Losiewicz, P., Oard, D. W., & Kostoff, R. N. (2000). Textual data mining to support
science and technology management. Online:
35. Text Mining 35
http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of
Research and Development, 2, 159-165.
Marcus, M. (1995). New trends in natural language processing: Statistical natural
language processing. Proceedings of the National Academy of Sciences, 92, 10052-10059.
Mjolsness, E., & DeCoste, D. (2001). Machine learning for science: State of the art and
future prospects. Science, 293, 2051-2055.
Perez-Carballo, J., & Strzalkowski, T. (2000). Natural language information retrieval:
Progress report. Information Processing and Management, 37, 155-178.
Perrin, P. (2001). Personal communication, Molecular Systems research group, Merck &
Co., Inc., Rahway, NJ.
Qin, J. (2000). Working with data: Discovering knowledge through mining and analysis.
Bulletin of the American Society for Information Science, 27. Online:
http://www.asis.org/Bulletin/Oct-00/qin.html
Rau, L. F. (1988). Conceptual information extraction and retrieval from natural language
input. In RIAO 88, pp. 424-437. Paris: Centre des Hautes Etudes Internationales d'Informatique
Documentaire, 1997, General Electric, USA.
Rindflesch, T. C., Hunter, L., & Aronson, A. R. (1999). Mining molecular binding
terminology from biomedical text. Proceedings of the American Medical Informatics
Association Symposium, 1999, 127-131. Online:
http://www.amia.org/pubs/symposia/D005564.PDF
Rindflesch, T. C., Tanabe, L., Weinstein, J. N., & Hunter, L. (2000). EDGAR: extraction
of drugs, genes and relations from the biomedical literature. Pacific Symposium on
36. Text Mining 36
Biocomputing, 2000, 517-528.
Russell, S., & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Upper
Saddle River, NJ: Prentice Hall.
Salton, G. (1992). The state of retrieval systems evaluation. Information Processing and
Management, 28, 441-449.
Salton, G., Allan, J., Buckley, C., & Singhal, A. (1994). Automatic analysis, theme
generation, and summarization of machine-readable texts. Science, 264, 1421-1426.
Saracevic, T. (2001). Personal communication and class discussions, Seminar in
Information Studies, Rutgers University, School of Communication, Information and Library
Studies, New Brunswick, NJ.
SDM (2001). Text mining 2002 [workshop prospectus]. Second SIAM International
Conference on Data Mining, Arlingon, VA, April 13, 2002. Online:
http://www.cs.utk.edu/tmw02/
Sneiderman, C. A., Rindflesch, T. C., Aronson, A. R. (1996). Finding the findings:
identification of findings in medical literature using restricted natural language processing.
Proceedings of the American Medical Informatics Association Annual Fall Symposium, 1996,
239-243.
Stapley, B. J. (2000). BioBiblioMetrics [On-line]. Available: http://www.bmm.icnet.uk/
~stapleyb/biobib/
Stapley, B. J., & Benoit, G. (2000). Biobibliometrics: information retrieval and
visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on
Biocomputing, 2000, 529-540.
Swanson, D. R. (1988). Historical note: Information retrieval and the future of an
37. Text Mining 37
illusion. Journal of the American Society for Information Science, 39, 92-98.
Swanson, D. R., & Smalheiser, N. R. (1997). An interactive system for finding
complementary literatures: A stimulus to scientific discovery. Artificial Intelligence, 91,
183-203.
Swanson, D. R., & Smalheiser, N. R. (1999). Implicit text linkages between Medline
records: Using Arrowsmith as an aid to scientific discovery. Library Trends, 48, 48-51.
Swanson, D. R., Smalheiser, N. R., & Bookstein, A. (2001). Information discovery from
complementary literatures: Categorizing viruses as potential weapons. Journal of the American
Society for Information Science and Technology, 52, 797-812.
Tanabe, L., Scherf, U., Smith, L. H., Lee, J. K., Hunter, L., & Weinstein, J. H. (1999).
MedMiner: An Internet text-mining tool for biomedical information, with application to gene
expression profiling. BioTechniques, 27, 1210-1217.
Thomas, J., Milward, D., Ouzounis, C., Pulman, S., & Carroll, M. (2000). Automatic
extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing,
2000, 541-552.
Wilbur, W. J., & Yang, Y. (1996). An analysis of statistical term strength and its use in
the indexing and retrieval of molecular biology texts. Computers in Biology and Medicine,
26(3):209-222.
Wise, J. A. (1999). The ecological approach to text visualization. Journal of the
American Society for Information Science, 50(13):1224-1233.
Witten, I. H., & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. San Francisco: Morgan Kaufmann (Academic Press).
38. Text Mining 38
Table 1.
Initial List of Information Retrieval (IR) Concepts Related to Text Mining.
IR concept Authority (see References)
Artificial intelligence Fan; Perrin
Bioinformatics Futrelle; Perrin
Citation mining Kostoff
Computational Linguistics Fan; Hearst
Conceptual Graphs KRDL
Data Abstraction Fan
Data Mining Fan; Perrin; SDM
Database Tomography Kostoff
Document Mining Fan
Domain Knowledge KRDL
Electronic Commerce Fan
Factor Analysis SDM
Information Access Hearst
Information Extraction Chowdhury; Fan; Futrelle; Kostoff; Perrin
Information filtering Fan
Information Integration Fan
Information Retrieval Fan; Perrin
Information Visualization/Mapping Futrelle; Fan; SDM
Intelligent Agents ("bots") Fan
39. Text Mining 39
Knowledge Discovery Fan
Knowledge Extraction Perrin
Knowledge Representation Perrin
Language Identification IBM
Machine Learning Fan; Futrelle; Perrin
Metadata Generation SDM
Natural language processing Fan; Futrelle; Perrin; Rindflesch; Saracevic
Ontologies/Vocabularies/Lexicons Futrelle
Phrase Extraction Fan
Question Answering Futrelle
Resource Discovery Fan
Resource Indexing Fan
Semantic Modeling Perrin; SDM
Semantic Processing Rindflesch
Statistical Language Modeling Fan
Stemming SDM
Syntactic Processing Saracevic
Template Mining Chowdhury; KRDL
Text Analysis Futrelle; IBM
Text Classification/Categorization Fan; Hearst (distinct); IBM; SDM
Text Clustering Fan; IBM
Text Data Mining Hearst; Kostoff
Text Parsing SDM
40. Text Mining 40
Text Purification SDM
Text Segmentation/"TextTiling" Hearst; SDM
Text Summarization Futrelle; IBM; Saracevic; SDM
Text Understanding Futrelle; Fan
Web Data Mining Hearst
Web Mining Fan
Web Utilization Mining Fan
41. Text Mining 41
Figure 1. ThemeScape™ visualization of a collection of 4,314 Y2K debate forum documents
(Cartia, 2000, expired website).
42. Text Mining 42
Figure 2. BioBiblioMetrics retrieval from a search on “DNA repair” and “recombination”
(Stapley, 2000).