1. IR&NLP Coursework P1
Text Analysis Within The Fields Of Information Retrieval and
Natural Language Processing
By Ben Addley
2003695
Academic Year 2004 - 2005
2. Ben Addley IR&NLP Coursework P1
Abstract
As users and producers of information, we have created a spiral of ever increasing quantities of
stored data. With the amount of directly accessible data available to us today, we need to find
new ways in which to manage and review this overabundance of textual information.
Text analysis, and in the case of this coursework, Automatic Text Analysis is the study of how a
computer can be used to identify the context and importance of text without the intervention of a
human. This paper introduces the technical and linguistic terminology necessary for an
understanding of the role Text Analysis plays in the two core application areas of Information
Retrieval (IR) and Natural language Processing (NLP).
Text analysis is not something new, understanding what fundamentally links words, sentences
passages and even whole books has been of interest since the middle ages. The first
advancement in pure computer based TA came during the 1950’s, however research into
computational linguistics and the “holy grail” of an understanding machine began in the 1940’s,
right at the birth of modern computing.
The basic building block of text analysis is textual understanding and in particular the language
it is built upon. Text analysis is based around two differing approaches; Rule based where text
is parsed through a series of modular analytical components, each analytical area concentrates
on a particular component of natural language (morphemes, words, syntax, semantics etc) and
calls upon a number of pre-defined rules. The other is Statistical based learning (sometimes
referred to as machine or example based learning). It requires the computer to take in huge
amounts of text, and by comparing common compositions of texts, gradually “learns” (through
statistical analysis) how language is formed. As well as the two approaches to the underlining
technology (rule and statistical based systems), a “third way” exists in hybrid systems combining
the two to achieve more effective systems.
This is still an emerging and exciting area of Computer Science with research being undertaken
to improve quality all areas. There is a real sense that there are still a number of advancements
possible in automating a number of the process currently undertaken by humans. There is also
a business requirement driving research and development. Regardless of whether we approve
of it or not, we now live in a truly global village with a need for fast, efficient and above all
accurate tools to assist in our international trading environment.
Keywords: Text Analysis, Textual Understanding, Information Retrieval, Natural
Language Processing, Example Based Machine Learning, Rule Based Machine Learning.
-2-
3. Ben Addley IR&NLP Coursework P1
Contents
ACADEMIC YEAR 2004 - 2005................................................................................................................. 1
ABSTRACT.................................................................................................................................................. 2
CONTENTS.................................................................................................................................................. 3
....................................................................................................................................................................... 3
DISCUSSION NOTES................................................................................................................................. 5
QUESTIONS FOR DISCUSSION............................................................................................................. 6
1. INTRODUCTION.................................................................................................................................... 7
2. HISTORY OF TEXT ANALYSIS.......................................................................................................... 8
2.1 THE INNOVATORS........................................................................................................................................ 8
3. THE BUILDING BLOCKS OF TEXTUAL UNDERSTANDING..................................................... 9
3.1 RULE BASED APPROACH.............................................................................................................................. 9
3.2 STATISTICAL BASED APPROACH.................................................................................................................... 9
3.3 AMBIGUITY............................................................................................................................................... 11
4. TEXT ANALYSIS: TECHNOLOGY OVERVIEW.......................................................................... 11
4.1 PRE-PROCESSING....................................................................................................................................... 11
4.2 PRODUCT TYPES....................................................................................................................................... 11
-3-
4. Ben Addley IR&NLP Coursework P1
4.2.1 LANGUAGE ............................................................................................................................................ 11
4.2.2 CONTENT .............................................................................................................................................. 11
5. TEXT ANALYSIS WITHIN THE FIELD OF INFORMATION RETRIEVAL ...........................12
5.1 DISTILLING THE MEANING OF A DOCUMENT................................................................................................ 12
5.2 TEXT BASED NAVIGATION......................................................................................................................... 13
5.2 TOPIC STRUCTURE ................................................................................................................................... 13
5.3 CLUSTERING............................................................................................................................................. 15
5.4 AUTOMATIC TEXTUAL SUMMARISATION...................................................................................................... 15
5.4.1 EXTRACTION .......................................................................................................................................... 15
5.4.2 ABSTRACTION ........................................................................................................................................ 15
6. TEXT ANALYSIS WITHIN THE FIELD OF NATURAL LANGUAGE PROCESSING............ 16
6.1 MACHINE TRANSLATION............................................................................................................................ 16
6.2 SPEECH TO TEXT SYSTEMS........................................................................................................................ 16
6.3 TEXT TO SPEECH SYSTEMS........................................................................................................................ 17
6.4 CHATTERBOTS (UNDERSTANDER-SYSTEMS).................................................................................................. 17
6.5 ANTI PLAGIARISM TOOLS........................................................................................................................... 17
7. WHAT THE FUTURE HOLDS (IN MY OPINION)......................................................................... 18
8. CONCLUSIONS.................................................................................................................................... 19
9. REFERENCES....................................................................................................................................... 20
9.1 WEB REFERENCES.................................................................................................................................... 20
9.1.1 GENERAL WEB RESOURCES...................................................................................................................... 22
9.1.2 GENERAL BOOK RESOURCES..................................................................................................................... 23
FIGURE 1: COMPUTERISED UNDERSTANG OF LANGUAGE, WINOGRAD T. STANFORD
AI SERIES.................................................................................................................................................. 10
FIGURE 2: EXAMPLE OF TOPIC STRUCTURE (REPORT WIDE)..............................................14
-4-
5. Ben Addley IR&NLP Coursework P1
Discussion Notes
Article from The New York Times December 25, 2003: ‘Get Me Rewrite!’ ‘Hold On, I’ll Pass
You the Computer.’ By Anne Eisenberg.
In the famous sketch from the TV show “Monty Python’s Flying Circus,” the actor John Cleese
had many ways of saying a parrot was dead, among them, “This parrot is no more,” “He’s
expired and gone to meet his maker,” and “His metabolic processes are now history.”
Computers can’t do nearly that well at paraphrasing. English sentences with the same meaning
take so many different forms that it has been difficult to get computers to recognize
paraphrases, much less produce them. Now, using several methods, including statistical
techniques borrowed from gene analysis, two researchers have created a program that can
automatically generate paraphrases of English sentences.
The program gathers text from online news services on specific subjects, learns the
characteristic patterns of sentences in these groupings and then uses those patterns to create
new sentences that give equivalent information in different words. The researchers, Regina
Barzilay, an assistant professor in the department of electrical engineering and computer
science at the Massachusetts Institute of Technology, and Lillian Lee, an associate professor of
computer science at Cornell University, said that while the program would not yield paraphrases
as zany as those in the Monty Python sketch, it is fairly adept at rewording the flat cadences of
news service prose. Give it a sentence like “The surprise bombing injured 20 people, 5 of them
seriously,” Dr. Barzilay said, and it can match it to equivalent patterns in its databank and then
produce a handful of paraphrases. For instance, it might come up with “Twenty people were
wounded in the explosion, among them five in serious condition.”
Programs that can detect or crank out multiple paraphrases for English sentences could one
day have wide use. They might help create summaries of reports or check a document for
repetition or plagiarism. Questions typed into a computer with such a program might in the
future be automatically paraphrased to make it easier for a search engine to find data.
Such programs might even be an aid to writers who want to adapt their prose to the background
of their readers. Dr. Lee said the researchers had thought about using it “as a kind of ‘style dial’
“ to rewrite documents automatically for different groups - adapting articles on technical subjects
for a children’s encyclopaedia, for example. She cautioned, however, that the work was
preliminary and much more research was needed before it might be available for practical use.
Fernando Pereira, chairman of the computer and information science department at the
University of Pennsylvania said that the paraphrasing work had given him pause. “It’s a little bit
humbling if you have the idea that we are creative when we write,” he said, only to discover that
one’s special turns of phrase have already been tried by hundreds of other writers and can be
found online.
“The real insight of this work,” he said, “is that if there is a way of saying something, someone
has already said it.”
-5-
6. Ben Addley IR&NLP Coursework P1
Questions for Discussion
1. The program outlined in the article above gathers text from online news services,
learns the characteristic patterns of sentences in these groupings and then uses
those patterns to create new sentences that give equivalent information in different
words. Do you think a program like this can only operate within a limited domain like
news reports or can they be applied to wider subjects?
2. What are your thoughts on Dr Fernando Pereira’s comment; “The real insight of this
work, is that if there is a way of saying something, someone has already said it.”
3. What are the potential problems with applications that adapt prose or paraphrase
sentences? Do you think that the context or subtle meaning might be lost if parsed
through software as described in the article?
4. Do you think that tools such as Barzilay and Lee’s are applications to aid humans in
the work they do or should they be used instead of humans? What are the problems
with the latter option and are there any ways in which other areas of NLP could be
applied to assist?
-6-
7. Ben Addley IR&NLP Coursework P1
1. Introduction
What is Text Analysis? A very basic question which you would expect to have a very basic
answer. Unfortunately this particular area concerned with the textual understanding of a
document is more challenging, where a two-line definition simply doesn’t qualify.
In this paper I will introduce the technical and linguistic terminology necessary for an
understanding of the role Text Analysis plays in the two core application areas of Information
Retrieval (IR) and Natural language Processing (NLP). I will not cover the complicated and
highly technical processes, which go into making Text Analysis actually work. Instead presented
below constitutes a mere introduction and should allow the user to research areas of interest in
more depth.
In its rawest sense, text analysis is key to any IR or NLP process. We as users must know some
information abut what it is we are trying to retrieve, this comes from perhaps the title of the work,
the authors name, section or chapter headers and of course the content. As human beings we
are able to ingest and process this data in a number of sophisticated ways, the most important
of these is our cognitive ability to understand context and importance of certain words and
phrases. Text analysis, and in the case of this coursework, Automatic Text Analysis is the study
of how a computer can be used to identify the context and importance of text without the
intervention of a human.
Wouldn’t it be nice not to have to rely on others IT literacy when searching for documents via
search engines or the web? To be able to simply enter a phrase, question or query and have a
computer return not just a document that has those keywords in but somehow understands
what it was you were trying to look for and return only those documents you were actually
seeking.
These techniques, as well as being fundamental to IR can aid and extend the effectiveness of
other areas such as research within NLP; it can aid interpretation, textual navigation, translation,
speech based tools and facilitate word/phrase-spotting programs.
As our world becomes a smaller place with increased and more effective communication
networks, we have to utilise these techniques to add value to our interactions. Asking a natural
question and receiving a natural, sensible and contextual response from a machine is an
important goal in our global community. Text Analysis plays a fundamental role in achieving that
goal but it is a complicated one!
This coursework will attempt to answer some of the basic questions involved in how a computer
achieves this feat of Artificial Intelligence (AI). It will also look at where Text Analysis technology
currently resides within the field of IR & NLP and what the future might hold if we continue to
research and develop in this exciting and challenging area.
A question you may wish to bear in mind whilst reading this document is; Does Text Analysis
constitute actual understanding of the textual input or simply an electronic approximation of
understanding?
-7-
8. Ben Addley IR&NLP Coursework P1
2. History of Text Analysis
Text analysis is not something new, understanding what fundamentally links words, sentences
passages and even whole books has been of interest since the middle ages. The first threads of
text analysis can be followed back to medieval biblical scholarship. Intellectuals would try to find
parallels between the new and Old Testament where passages might be linked according to
places, periods and people. This resulted in the first concordances1, a tool still used in computer
based text analysis today.
The first advancement in pure computer based TA came during the 1950’s, however research
into computational linguistics and the “holy grail” of an understanding machine began in the
1940’s, right at the birth of modern computing.
2.1 The Innovators
“In 1949, Warren Weaver proposed that computers might be useful for ‘the solution of world-
wide translation problems’ and the resulting effort, called machine translation, attempted to
simulate with a computer the presumed functions of a human translator” [Avron Barr 1980]. IBM
first demonstrated a basic word for word translation machine in 1954 but these early attempts at
machine translation failed due to the simplistic idea that word equivalency techniques and
sentence re-ordering would suffice for a translation machine.
AI research took on new ideas, the most important of these being textual understanding. The
work of Chompsky in the field of linguistic theory coupled with advancements in programming
languages in the 1960’s heralded a surge in AI/NLP work. By the 1970’s the approach was to
model human language as knowledge based systems and to understand how language works
to build up rules that apply to these systems.
Terry Winograd in 1972, “groups natural language programs according to how they represent
and use knowledge of their subject matter” [Avron Barr 1980]. He proposed four historical
groupings based on this approach; the first programs during this period that attempted to
analyse and understand textual input were built on limited domains; BASEBALL, ELIZA etc.
Then followed systems that used semantic memories and indexing to retrieve and understand
words or phrases. A third approach followed during the mid to late sixties called limited logic
systems and finally a forth group, knowledge based systems, using first order logic and
semantic nets, such as William Woods’s LUNAR program and Winograds SHRDLU system.
More recently we have seen a shift away from the ideas of Winograd and the early pioneers of
textual understanding within NLP. With increased processor speeds and the advent of super-
computing, new methods and theories have been developed in example based machine
learning. Today the arguments rage about the respective merits of the two main approaches,
with new developments in hybrid systems combining the two.
1
In its most basic form it’s an index that includes a line of context against each entry or occurrence of a
word.
-8-
9. Ben Addley IR&NLP Coursework P1
3. The Building Blocks of Textual Understanding
The basic building block of text analysis is textual understanding and in particular the language
it is built upon. When we discuss the fundamentals of this subject we must also understand the
guiding principles of language makeup.
We are unique from all other species of animal in that we don’t just communicate through
signals (as other creatures are capable of) but use sophisticated language properties to do so.
Artificial Intelligence (AI) is the branch of Computer Science that undertakes investigation and
development of models and tools to replicate this “behaviour” in machines. AI has been defined
as “the science of making machines do things that would require intelligence if done by men”
[Minsky 1968].
Winograd proposed a series of pre-defined stages that must be adhered to for computerised
understanding of language to occur. These follow closely the traditional linguistic approach to
word formation processes.
3.1 Rule Based Approach
Figure 1 (below), demonstrates this logical rule based approach to textual understanding by
parsing written language through a series of modular analytical components. The output of one
module acts as an input to the next and so on until the process has been completed. Each
analytical area concentrates on a particular component of natural language (morphemes, words,
syntax, semantics etc) and calls upon a number of pre-defined rules (shown in the ellipses) to
adjudge whether that component of text fits in with a particular rule. Once identified it will be
passed on to the next module for further analysis, if later (usually at the semantic or pragmatic
stage) it is proved to be incompatible or incorrect it will be passed back to a previous layer.
The disadvantage of such an approach is that it is language dependant and will need to be re-
modelled for each additional language you wanted to perform textual analysis on. The other
main problem is that it requires large amounts of initial human intervention in the programming
and rule development stage, this ties in the first problem. Such problems have an impact on
applications as we’ll see later in this paper.
3.2 Statistical Based Approach
Another name for this approach is machine learning or example based learning. It requires the
computer to parse huge amounts of text, and by comparing common compositions of texts
(usually domain specific), gradually “learns” through statistical analysis how syntax and certain
bigrams and trigrams of words and phrases within a language are formed. The advantage to
this method is it is language independent and requires little human intervention. In theory if you
have a large enough corpus in any language you can teach the system that language in a
relatively short period of time. The disadvantage comes from the scale and size of corpus
required, that contains enough compositions and varied bigrams and trigrams to develop a
sufficient understanding of language.
-9-
10. Ben Addley IR&NLP Coursework P1
Figure 1: Computerised Understang of Language, Winograd T. Stanford AI Series
- 10 -
11. Ben Addley IR&NLP Coursework P1
3.3 Ambiguity
Ambiguity increases the range of possible interpretations of natural language, and a computer
has to find a way to deal with this. [Inman D. 1997] This is another key issue for Text Analysis
as computers have to make choices on how interpretations of words and phrases are made.
This is an easier problem to overcome with an example based learning approach.
4. Text Analysis: Technology Overview
Within Text Analysis we have seen the two approaches to the underlining technology (rule and
statistical based systems). There is of course a third way encompassing a hybrid method
combining the best elements of rule based and statistical based methodologies.
4.1 Pre-processing
When undertaking an analysis on target text it is helpful to carry out automatic pre-processing to
strip away the words with no semantic meaning. This can aid in Example based learning as
unusual or one off bigrams/trigrams are negated. Another approach is to carry out stemming,
this enables prefixes, suffixes and endings, also known as morphemes to be identified leaving
just the stem (or core meaning) of the word. For example learn is the stem of learning, by
concentrating on the stem both terms will be identified as having the same core and thereby
improving the analysis of the text.
4.2 Product Types
There is a rich variety of software that supports the general task of Text Analysis within the
different disciplines of Human Computer Interaction (HCI). Due to this variety it is helpful to set
out a brief general classification of the main areas. The below, taken from Harold Klein’s
discussions after the Acapulco ICA conference in 2000, breaks text analysis software down into
two broad areas. Firstly language and its makeup and secondly obviously content, which deals
with the “what” being communicated:
4.2.1 Language
Dealing with the use of language of which there are two further sub-categories;
• Linguistic: applications like parsing, lemmatising words2
• Data bank: information retrieval in texts, indexers, concordances, word lists,
KWIC/KWOC (key-word-in-context, key-word-out of-context)
4.2.2 Content
Dealing with the content of human communication, mainly texts.
• Qualitative: looking for regularities and differences in text, exploring the whole text
2
Lemmatising means grouping related words together under a single headword. A Lemmatiser (tool) allows you to
define groups of related words and then apply your groupings to words displayed in the Wordlist.
- 11 -
12. Ben Addley IR&NLP Coursework P1
(QDA - qualitative data analysis). A few programs allow the processing of audio and
video information also.
• Event data: analysis of events in textual data
• Quantitative: analyse the text selectively to test hypotheses and draw statistical
inferences. Output is a data matrix that represents the numerical results of the
coding.
o Category systems: provided by the software developer (instrumental) or by
the researcher (representational), this is selective, only search patterns are
searched in the text and coded. Software packages with built-in dictionaries
are often language restricted, some have limits on the text unit size and are
restricted to process responses to open ended questions but not to analyse
mass media texts. The categories can be thematic or semantic; this can have
implications on the definition of text units and external variables.
o No category system: using co-occurrences of words/strings and/or
concepts, these are displayed as graphs or dendrograms.
o For coding responses to open ended questions only: these programs
cannot analyse huge amount of texts, they fit for rather homogeneous texts
only and are often limited in the size of a text unit.
[Klein 2002]
Obviously the above is just one interpretation of the many approaches taken in the field of Text
Analysis.
5. Text Analysis Within the Field of Information Retrieval
So we now know a little about what text analysis is and the linguistic theory behind the concept.
The key question now, is how do we transform this powerful concept into products that can
actually help us in our day-to-day lives? Text analysis is already used in many commercial
systems and the first part of this section will investigate where the technology currently lies,
what it is used for and who is using it.
5.1 Distilling the Meaning of a Document
Making informed and ultimately correct decisions in our busy working life often requires
analysing large and time consuming volumes of textual information. Students, Researchers, and
professionals (such as Analysts, Lawyers and Editors) are faced by various TA tasks, all
requiring the extraction of the core meaning from the document.
In today’s “information rich” environment, huge piles of information build up in traditional
repositories held in libraries, businesses, individual PCs, and the of course the ever ubiquitous
World Wide Web. The amount of information being produced and stored is growing at an
extraordinary rate with some predictions stating that we will, at current growth rates, have more
- 12 -
13. Ben Addley IR&NLP Coursework P1
information them atoms to store them on! Whether you believe the doomsday like prophecies or
not, it is a fact that the human beings are increasingly unable to meet the challenges of this
growth. “Mankind is searching for intelligent electronic assistants to help with text analysis
projects” [HALLoGRAM Publishing, 2000].
In particular we require help to derive the semantic value of a document in a concise form. Once
achieved we can apply the knowledge to a number of other applications, as we’ll discuss
throughout the rest of this section.
5.2 Text Based Navigation
To understand this application for Text Analysis we need understand semantic networks.
“Semantic networks are knowledge representation schemes involving nodes and links between
nodes” [Duke University]. Essentially a conceptual web of linked nodes that point to all other
nodes containing listed objects. “Concepts stored in the semantic network are hyper linked to
those sentences where they have been encountered, and the sentences are in turn hyper linked
to the places in the original text from where they have been retrieved” [Sergei Ananyan,
Alexander Kharlamov 2004].
By using Text Analysis principals to build semantic networks automatically we can efficiently
navigate through stored texts and useful linked documents, which is a core application within
the wider field of Information Retrieval. This can be done simultaneously to multiple documents
thus creating a very powerful IR tool. Direct applications for this can be found in website design
and navigation and I myself have investigated a version of this tool for my BSc final project. I
have used Microsoft Indexing Services (part of the MS IIS suite of administrative tools) to index
and analyse documents stored within a catalogue. It basically analyses the textual content of
documents, sorting it into categories, which can then search on using a simple query language.
5.2 Topic Structure
What do we mean when we talk about topic structure? Well, we can use text analysis to identify
the most relevant and significant concepts from the semantic network of the text and transform it
into a tree like structure of topics sorted by importance. The limbs of the tree structure represent
relations of headings and content in the text. Some of these limbs are strong and form the basis
of the structure, other are weaker and are often irrelevant or indirect and need to be replaced
with more direct ones. To take this paper as an example; I have an introduction with no nested
topics, however I then have a series of topics some with sub topics and areas of interest. This is
built up over the document as a whole to construct a visual structure to reveal a hierarchy of
themes within the text, which can then be used as a powerful information retrieval method.
Below is a tree like listing of this coursework as viewed through a topic structure. Main headings
hug the left hand side (red line) with secondary (blue) and tertiary (green) topics indented to the
right. Content text is omitted. This is a rather simplistic model but demonstrates the point.
- 13 -
14. Ben Addley IR&NLP Coursework P1
Figure 2: Example of Topic Structure (Report Wide)
- 14 -
15. Ben Addley IR&NLP Coursework P1
5.3 Clustering
Clustering is built upon the previous technology of topic structure but goes one stage further.
Topic structures sever the links representing weak relations in the text and substitutes certain
indirect relations with direct ones. With clustering on the other hand, those links that falls below
a pre-defined level of weakness/strength are eliminated altogether. This allows a break up of
texts that have been collected together to form individual groups that more clearly represent a
common subject or theme. This allows documents to be grouped into particular subject areas
facilitating searching, indexing and analysis on that theme as well as on the original topic
structure.
5.4 Automatic Textual Summarisation
“The goal of automatic summarisation is to take an information source, extract content from it,
and present the most important content to the user in a condensed form and in a manner
sensitive to the user’s or application’s need”. [Inderjeet Mani 2001]
Textual summarisation techniques have their historical roots in the 1960’s, however with the
proliferation of the Internet and growth in document production, new interest in the technology
has developed. Broadly speaking techniques can be divided into two categories;
5.4.1 Extraction
Extraction techniques in general, simply copy the information deemed most important into a
summary. There are numerous methods for carrying out Extraction summarisation, one
important and widely used methodology utilises algorithms to score individual sentences in the
target text. It is both robust and accurate involving the number of important semantic concepts
in a sentence. The larger the number and the stronger these concepts are, coupled with the
relationship they have with each other, the higher the semantic weight (or score) of the
sentence. The summarisation tool then sorts and collects only those sentences, which fall
above a pre-set score or weight, thus resulting in a truncated piece of text that summarises the
original…hopefully accurately!
“The size of the summary is controlled through changing the sentence selection threshold. An
advanced algorithm used for developing an accurate semantic network ensures the high quality
and relevance of the created summary”. [Sergei Ananyan, Alexander Kharlamov 2004] The
same concept can be applied to paragraphs and other units of text within documents.
5.4.2 Abstraction
Abstraction in its purest sense is the process of distancing objects from ideas. In the context of
summarisation it involves paraphrasing sections of the source document. Usually, abstraction
can condense a text more thoroughly than the extraction process discussed above, but the
programs that do this are generally harder to program and implement. An example of
abstraction in summarisation would be to “understand” the concept of a sentence or phrase.
So how do we use automatic textual summarization and what is it good for? There are
numerous practical applications and professions using the technology in everyday tasks.
Probably the most common of which is in the news and broadcasting industry, where
summarisation is used for newspaper articles and scientific and technological journals. Another
- 15 -
16. Ben Addley IR&NLP Coursework P1
important and widely used application is in search engine technology and IR. Automatic textual
summarisation is a “cross over” application used in both the IR fields and NLP applications.
6. Text Analysis Within the Field of Natural Language Processing
In this section I’ll discuss some of the realised applications and products for Text Analysis
technology and some current and future advancements. TA is just one part of the wider AI
problem within NLP that we face. Core to the issue is NLP and how we can develop systems
that are able to interpret the way we, as humans communicate. It is the basis of many other
technologies as well information retrieval, machine translation for example is based on the
textual analysis of data as are text to speech engines and language spotters, all discussed
further in this section.
6.1 Machine Translation
“Machine translation (MT) is the application of computers to the task of translating texts from
one natural language to another. One of the very earliest pursuits in computer science, MT has
proved to be an elusive goal, but today a number of systems are available which produce output
which, if not perfect, is of sufficient quality to be useful in a number of specific domains” [EAMT
June 2004]. MT is probably the most well known of text analysis applications in existence today
but still remains one of the most intangible. Although there have been many advances in the
field in recent years, products are still far from perfect and often can’t deal with truly natural
language such as colloquialisms.
The primary role of MT software is two fold;
1. Provides Gisting (or rough translations) to non-native speakers of a particular language,
enabling them to gain an understanding of the document and whether it is of relevance
to them.
2. As part of professional transcription and translation workbenches. This application allows
users with knowledge of a language to filter and offer triages of large amounts of text.
This can free up the professionals time, give them a head start on the target text and
enable them to work more efficiently. There are other associated tools within this area
such as translation memories and glossaries that also rely on text analysis techniques,
however this falls outside of the main scope of the project.
6.2 Speech to Text Systems
I’m sure we’ve now all been exposed to the technological wonder that is Speech to text
systems. Every time we call up our bank or our gas supplier we are faced with the prospect of
an automated voice asking us to tell them, through speech, which service we want. After
repeatedly asking for our balance we are invariably put through to the section dealing with
selling services of one description or another! It’s true that the technology doesn’t seem to be
accurate but big business has identified it as a major efficiency saving, and that is the driver
behind it.
There are of course other, more refined applications for the technology, language recognition
- 16 -
17. Ben Addley IR&NLP Coursework P1
and speech recognition, biometric tools and speaker verification tools. There are some links to
companies providing these applications in the general web resources section.
6.3 Text to Speech Systems
One of the first aspects of natural language to be modelled was the actual articulation of speech
sounds. “Early models of ‘talking machines’ were essentially devices which mechanically
simulated the operation of the human vocal tract. More modern attempts to create speech
electronically, are generally referred to as speech synthesis”. [Yule G. The Study of Language;
Pg 115]
The concept is to take a text and using advanced TA techniques, tokenise it into phonemes that
make up the individual words. You then electronically reproduce the acoustic properties of those
phonemes into sounds that can be played. It is a little trickier then the over-simplified
explanation I have given but it demonstrates the idea. There are multiple uses such as:
• Call centre technology
• Mobile texting to landline phones - where the text message is translated into speech
and transferred automatically to the designated landline connection. A working
product developed by Loquendo (a subsidiary of Italia Telecom)
6.4 Chatterbots (Understander-Systems)
Things have come a long way from the early days of the Eliza in the 60’s and Michael Mauldin
from Carnegie Mellon University who coined the term “Chatterbot” in 1994. Basic Chatterbots
such as Eliza, Alice and Brian3 use a process of pattern recognition to analyse text and create
an illusion of understanding. Questions posed by the user triggered by occurrences of keywords
or phrases; activate a particular type of pre-determined response, usually a question,
incorporating that phrase or keyword in the answer.
6.5 Anti Plagiarism tools
Text Analysis and textual understanding is at the heart of most anti plagiarism tools. To take an
example from LSBU’s own efforts in this area, Thomas Lancaster (my old Java lecturer!) has
developed a number of tools, one of which is; “Text Analysis Tool (TAT), a system which
presents a rolling representation of the stylistic properties of a submission so find areas that are
likely to represent extra-corpal plagiarism.” [Lancaster 2002]
This is based around syntactic similarities in two target texts. Some tools will use this core
technique and provide visual representations of the results (VAST) while others will simply
present the two offending articles in a way that allows the user to make a decision more easily
on the material they have in front of them (SSS, TRANK). This is an obvious labour saving tool
for teachers and lecturers…and a source of fear and loathing by students the world over!
3
Visit the AAAI website for an extensive listing of Chatterbots (see reference section)
- 17 -
18. Ben Addley IR&NLP Coursework P1
7. What the Future Holds (in my opinion)
This is still an emerging and exciting area of Computer Science with research being undertaken
to improve quality all the areas mentioned in the sections above. There is a real sense that
there are still a number of advancements possible in automating a number of the process
currently undertaken by humans. There is also a business requirement driving research and
development. Regardless of whether we approve of it or not, we now live in a truly global village
with a need for fast, efficient and above all accurate tools to assist in our international electronic
trading environment.
Below is a summary of some of the major technological advancements I predict to occur in the
near to middle future. Some are logical progressions to the current technology available from
commercial companies. Others are slightly more experimental in that may require a radical
approach to the way we look at the problems before developing viable solutions.
• We should start to see more enterprise solutions with Text Analysis at the core. There is
a lot of development in areas of speech to text and translation, with the goal being
combined technologies transparent from the user, with seamless input of foreign speech
to English textual output.
• Convergence of NLP and IR technologies with more efficient methods to search for
information in other languages, understanding the inferences of our query and returning
results in an intelligent way.
• Although English has gradually been accepted as the lingua franca in most industries
there is a real need for translation tools, especially those dedicated to textual translation
and the associated fields of retrieving that information. The EU alone dealt with
1,416,817 pages of text translation relying on text analysis in 2003, a rise of 9.4% from
2002 [EU DGT Annual Activity Report 2003] and other large bodies (both Governmental
and NGO’s) are increasing budgets in the area to cope. This will be a major growth area
and as such technological developments will follow.
• Email dialogue systems that take automatic input from mail servers, carries out a Text
Analysis process, interrogates associated applications such as diary programs, address
books etc, and outputs a response in the form of replies to original emails. An example
would be meeting planners trying to arrange for everyone in a special interest group
(SIG) to attend a quarterly meeting. One email is sent out by the chair suggesting dates,
once received, each recipients computer carries out the process described above and
returns an email to the chair with their availability. This may continue through a number
of iterations before everyone is available, but the key thing is that the process has been
automatic and independent of human interference.
There are of course many more examples, however, these seem logical and doable considering
the current level of knowledge and application of technology at the moment.
- 18 -
19. Ben Addley IR&NLP Coursework P1
8. Conclusions
As users and producers of information, we have created a spiral of ever increasing quantities of
stored data. With the amount of directly accessible data available to us today, we have had to
find new ways in which to manage and review this overabundance of textual information.
In this paper we have started to explore how a computer can be used to manage this problem
through identifying the context and importance of textual input. We’ve looked at the applications
of Text Analysis as part of NLP and within the field of Information Retrieval and explored
possible convergence in these areas. The key areas and technologies presented above are a
mere introduction and do not constitute an in depth view. By showing how they broadly connect
with Text Analysis and providing further resources (see reference section) I hope that the reader
will further research areas that directly interest them.
It is obvious to me from the study carried out for this project that the areas of text analysis and
textual understanding within AI is far from complete. Research and development continues in
Universities, Government and Industry, trying to achieve better, more natural responses to the
inputs we give machines. There are many approaches to this task and we’ve briefly looked at
rule-based analysis of text and also example based machine learning as well as hybrids of the
two.
At the beginning of this coursework report I posed a question; does Text Analysis constitute
actual understanding of the textual input or simply an electronic approximation of
understanding? The answer, in my opinion is that advancements in the technology have come a
long way in the past twenty years, but we have not developed, and are nowhere near
developing machines that understand in the way we understand. The question we should really
ask is; can we ever truly develop understanding in computers? I’ll conclude with the words of a
philosopher rather then a scientist.
“It is a very remarkable fact that there are none so depraved and stupid, without
even excepting of idiots, that they cannot arrange different words together,
forming of them a statement by which they make known their thoughts; while on
the other hand, there is no animal, however perfect and fortunately
circumstanced it may be, which can do the same…”
René Descartes (1637)
…What would he have made of mans attempts to program machines to do the
same!
- 19 -
20. Ben Addley IR&NLP Coursework P1
9. References
Eisenberg A. 2003
“Get Me Rewrite!” “Hold On, I’ll Pass You to the Computer.”
Article from The New York Times December 25, 2003
Inderjeet M. 2001.
Automatic Summarisation.
John Benjamins Publishing Company, Amsterdam/Philadephia.
Minsky M. L. 1968.
Semantic Information Processing
MIT Press
Yule G. 1985
The Study Of Language
Cambridge University Press
9.1 Web References
o Avron Barr
http://www.aaai.org/Library/Magazine/Vol01/01-01/vol01-01.html
Last Updated: Spring 1981
Downloaded: 19, November 2004
A paper entitled “Natural Language Understanding” by Barr of Stanford University.
Although published many years ago it still has a wealth of useful information, especially
good is the history of NLP (page 2)
- 20 -
21. Ben Addley IR&NLP Coursework P1
o Duke State University
http://www.duke.edu/~mccann/mwb/15semnet.htm
Last Updated: Unknown
Downloaded: 16, November 2004
o EAMT
http://www.eamt.org/mt.html
Last Updated: 3, June 2004
Downloaded: 20, November 2004
Home page of the European Association of Machine Translation, this site is a good
guide and technical resource in the area of MT.
o European Union Annual Activity Report 2003
http://europa.eu.int/geninfo/query/engine/search/query.pl(AAR 2003 - Report ONLY.doc)
Last Updated: 1, April 2004
Downloaded: 21, November 2004
Report into EU translation activities – very dry but full of useful statistics.
o HALLoGRAM Publishing
http://www.hallogram.com/textanalyst/
Last Updated: Unknown
Downloaded: 7, November 2004
A commercial site promoting a product called textanalyst. Has numerous information and
definition pages. Very good for an introduction to the subject.
o Dave Inman
http://www.scism.sbu.ac.uk/inmandw/tutorials/nlp/ambiguity/ambiguity.html
Last Updated: 25, February 1997
Downloaded: 21, November 2004
All pages within the NLP tutorial site are directly relevant to text analysis and the wider
fields of IR and NLP. A very good place to start any research in the area.
- 21 -
22. Ben Addley IR&NLP Coursework P1
o Harold Klein
http://www.textanalysis.info/terms.htm
Last Updated: 19, May 2002
Downloaded: 30, October 2004
A really good general site housing a huge range of information on text analysis. An
excellent place to start any research into the subject.
o Thomas Lancaster
http://www.radford.edu/~sigcse/DC01/participants/lancaster.html
Last Updated: 2002
Downloaded: 20, November 2004
A former PhD student at LSBU, Thomas developed a number of tools in the area of anti
plagiarism. For how this links in with TA in more detail either follow the above link or type
his name and the term “anti plagiarism” into google.
o Sergei Ananyan, Alexander Kharlamov
http://www.megaputer.com/tech/wp/tm.php3#nav
Last Updated: 2004
Downloaded: 7, November 2004
Automated Analysis of Natural Language Texts white paper offering all round useful
information.
9.1.1 General Web Resources
These links act as an additional resource on the subject of Text Analysis within Information
Retrieval. Most were used as general background reading for this coursework, some were
quoted above and some were not used at all.
http://www.textanalysis.com/ - VisualText™ - part of Text Analysis International, Inc.
• Incorporated company providing products in Info Extraction and NLP. Very useful for the
FAQ section: http://www.textanalysis.com/FAQs/faqs.html and the two main product
sections that give an overview of current technology capabilities:
http://www.textanalysis.com/Products/products.html
http://www.intext.de/eindex.html - Social Science Consulting (English language)
• Very good site dedicated to all things TA. Has a very good history section and
applications for this technology.
- 22 -
23. Ben Addley IR&NLP Coursework P1
http://www.semantic-knowledge.com/tropes.htm - Semantic Knowledge Product Site
• Another product on the Market, this time from Semantic Knowledge. Tropes offers TA
plus IR technologies as a joined up product. Not much in the way of background reading.
There is some useful info on how the engine works though: http://www.semantic-
knowledge.com/fonction.htm
http://www.megaputer.com/tech/wp/tm.php3 -
• White paper from Megaputer™ (company offering TextAnalysis product). Good
background and introduction to the subject. Very useful section on the history of the
subject: http://www.megaputer.com/tech/wp/tm.php3#history and new opportunities
looked at from the business perspective:
http://www.scism.lsbu.ac.uk/inmandw/tutorials/nlp/index.html
• Paper by Dave Inman on the complexities and possibilities of NLP via Computers. Not
specifically aimed at my coursework area but very useful questions raised and will be
useful to add context to TA within the wider field of IR. The two key links off this page are
“can computers understand language?” and “does the structure of language help NLP?”
http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/0.html
• What appears to be a very useful repository on all things NLP from Carnegie Mellon
University. It is fairly out of date stuff but good for general background
http://www.loquendo.it/en/index.htm
• English homepage of Loquendo, an Italian company providing application tools involving
speech to text and vice versa.
9.1.2 General Book Resources
Obviously as well as the above links there is also a good grounding to be found in the
core text book, FOA: A cognitive perspective on search engine technology and the www
(Belew R. K). As well as this book there is also the secondary textbook, Information
Retrieval (Van Rijsbergen C. J.).
This second book is available on the www and has a specific section on Automatic Text
Analysis: http://www.dcs.gla.ac.uk/Keith/Preface.html
- 23 -