SlideShare uma empresa Scribd logo
1 de 23
IR&NLP Coursework P1
Text Analysis Within The Fields Of Information Retrieval and
               Natural Language Processing
                        By Ben Addley
                           2003695
                   Academic Year 2004 - 2005
Ben Addley                                                          IR&NLP Coursework P1
                                           Abstract
As users and producers of information, we have created a spiral of ever increasing quantities of
stored data. With the amount of directly accessible data available to us today, we need to find
new ways in which to manage and review this overabundance of textual information.

Text analysis, and in the case of this coursework, Automatic Text Analysis is the study of how a
computer can be used to identify the context and importance of text without the intervention of a
human. This paper introduces the technical and linguistic terminology necessary for an
understanding of the role Text Analysis plays in the two core application areas of Information
Retrieval (IR) and Natural language Processing (NLP).

Text analysis is not something new, understanding what fundamentally links words, sentences
passages and even whole books has been of interest since the middle ages. The first
advancement in pure computer based TA came during the 1950’s, however research into
computational linguistics and the “holy grail” of an understanding machine began in the 1940’s,
right at the birth of modern computing.

The basic building block of text analysis is textual understanding and in particular the language
it is built upon. Text analysis is based around two differing approaches; Rule based where text
is parsed through a series of modular analytical components, each analytical area concentrates
on a particular component of natural language (morphemes, words, syntax, semantics etc) and
calls upon a number of pre-defined rules. The other is Statistical based learning (sometimes
referred to as machine or example based learning). It requires the computer to take in huge
amounts of text, and by comparing common compositions of texts, gradually “learns” (through
statistical analysis) how language is formed. As well as the two approaches to the underlining
technology (rule and statistical based systems), a “third way” exists in hybrid systems combining
the two to achieve more effective systems.

This is still an emerging and exciting area of Computer Science with research being undertaken
to improve quality all areas. There is a real sense that there are still a number of advancements
possible in automating a number of the process currently undertaken by humans. There is also
a business requirement driving research and development. Regardless of whether we approve
of it or not, we now live in a truly global village with a need for fast, efficient and above all
accurate tools to assist in our international trading environment.




Keywords: Text Analysis, Textual Understanding, Information Retrieval, Natural
Language Processing, Example Based Machine Learning, Rule Based Machine Learning.




                                              -2-
Ben Addley                                                                                                            IR&NLP Coursework P1
                                                                          Contents



ACADEMIC YEAR 2004 - 2005................................................................................................................. 1



ABSTRACT.................................................................................................................................................. 2



CONTENTS.................................................................................................................................................. 3



....................................................................................................................................................................... 3



DISCUSSION NOTES................................................................................................................................. 5



QUESTIONS FOR DISCUSSION............................................................................................................. 6



1. INTRODUCTION.................................................................................................................................... 7



2. HISTORY OF TEXT ANALYSIS.......................................................................................................... 8

2.1 THE INNOVATORS........................................................................................................................................ 8

3. THE BUILDING BLOCKS OF TEXTUAL UNDERSTANDING..................................................... 9

3.1 RULE BASED APPROACH.............................................................................................................................. 9
3.2 STATISTICAL BASED APPROACH.................................................................................................................... 9
3.3 AMBIGUITY............................................................................................................................................... 11

4. TEXT ANALYSIS: TECHNOLOGY OVERVIEW.......................................................................... 11

4.1 PRE-PROCESSING....................................................................................................................................... 11
4.2 PRODUCT TYPES....................................................................................................................................... 11
                                                                                -3-
Ben Addley                                                                                                      IR&NLP Coursework P1
4.2.1 LANGUAGE ............................................................................................................................................ 11
4.2.2 CONTENT .............................................................................................................................................. 11

5. TEXT ANALYSIS WITHIN THE FIELD OF INFORMATION RETRIEVAL ...........................12

5.1 DISTILLING THE MEANING OF A DOCUMENT................................................................................................                      12
5.2 TEXT BASED NAVIGATION.........................................................................................................................            13
5.2 TOPIC STRUCTURE ...................................................................................................................................       13
5.3 CLUSTERING.............................................................................................................................................   15
5.4 AUTOMATIC TEXTUAL SUMMARISATION......................................................................................................                     15
5.4.1 EXTRACTION ..........................................................................................................................................   15
5.4.2 ABSTRACTION ........................................................................................................................................    15

6. TEXT ANALYSIS WITHIN THE FIELD OF NATURAL LANGUAGE PROCESSING............ 16

6.1 MACHINE TRANSLATION............................................................................................................................           16
6.2 SPEECH TO TEXT SYSTEMS........................................................................................................................            16
6.3 TEXT TO SPEECH SYSTEMS........................................................................................................................            17
6.4 CHATTERBOTS (UNDERSTANDER-SYSTEMS)..................................................................................................                      17
6.5 ANTI PLAGIARISM TOOLS...........................................................................................................................          17

7. WHAT THE FUTURE HOLDS (IN MY OPINION)......................................................................... 18



8. CONCLUSIONS.................................................................................................................................... 19



9. REFERENCES....................................................................................................................................... 20

9.1 WEB REFERENCES.................................................................................................................................... 20
9.1.1 GENERAL WEB RESOURCES...................................................................................................................... 22
9.1.2 GENERAL BOOK RESOURCES..................................................................................................................... 23




FIGURE 1: COMPUTERISED UNDERSTANG OF LANGUAGE, WINOGRAD T. STANFORD
AI SERIES.................................................................................................................................................. 10



FIGURE 2: EXAMPLE OF TOPIC STRUCTURE (REPORT WIDE)..............................................14


                                                                             -4-
Ben Addley                                                            IR&NLP Coursework P1
                                       Discussion Notes
Article from The New York Times December 25, 2003: ‘Get Me Rewrite!’ ‘Hold On, I’ll Pass
You the Computer.’ By Anne Eisenberg.



In the famous sketch from the TV show “Monty Python’s Flying Circus,” the actor John Cleese
had many ways of saying a parrot was dead, among them, “This parrot is no more,” “He’s
expired and gone to meet his maker,” and “His metabolic processes are now history.”
Computers can’t do nearly that well at paraphrasing. English sentences with the same meaning
take so many different forms that it has been difficult to get computers to recognize
paraphrases, much less produce them. Now, using several methods, including statistical
techniques borrowed from gene analysis, two researchers have created a program that can
automatically generate paraphrases of English sentences.

The program gathers text from online news services on specific subjects, learns the
characteristic patterns of sentences in these groupings and then uses those patterns to create
new sentences that give equivalent information in different words. The researchers, Regina
Barzilay, an assistant professor in the department of electrical engineering and computer
science at the Massachusetts Institute of Technology, and Lillian Lee, an associate professor of
computer science at Cornell University, said that while the program would not yield paraphrases
as zany as those in the Monty Python sketch, it is fairly adept at rewording the flat cadences of
news service prose. Give it a sentence like “The surprise bombing injured 20 people, 5 of them
seriously,” Dr. Barzilay said, and it can match it to equivalent patterns in its databank and then
produce a handful of paraphrases. For instance, it might come up with “Twenty people were
wounded in the explosion, among them five in serious condition.”

Programs that can detect or crank out multiple paraphrases for English sentences could one
day have wide use. They might help create summaries of reports or check a document for
repetition or plagiarism. Questions typed into a computer with such a program might in the
future be automatically paraphrased to make it easier for a search engine to find data.

Such programs might even be an aid to writers who want to adapt their prose to the background
of their readers. Dr. Lee said the researchers had thought about using it “as a kind of ‘style dial’
“ to rewrite documents automatically for different groups - adapting articles on technical subjects
for a children’s encyclopaedia, for example. She cautioned, however, that the work was
preliminary and much more research was needed before it might be available for practical use.

Fernando Pereira, chairman of the computer and information science department at the
University of Pennsylvania said that the paraphrasing work had given him pause. “It’s a little bit
humbling if you have the idea that we are creative when we write,” he said, only to discover that
one’s special turns of phrase have already been tried by hundreds of other writers and can be
found online.

“The real insight of this work,” he said, “is that if there is a way of saying something, someone
has already said it.”



                                               -5-
Ben Addley                                                           IR&NLP Coursework P1
                                   Questions for Discussion


   1.        The program outlined in the article above gathers text from online news services,
             learns the characteristic patterns of sentences in these groupings and then uses
             those patterns to create new sentences that give equivalent information in different
             words. Do you think a program like this can only operate within a limited domain like
             news reports or can they be applied to wider subjects?

   2.        What are your thoughts on Dr Fernando Pereira’s comment; “The real insight of this
             work, is that if there is a way of saying something, someone has already said it.”

   3.        What are the potential problems with applications that adapt prose or paraphrase
             sentences? Do you think that the context or subtle meaning might be lost if parsed
             through software as described in the article?

   4.        Do you think that tools such as Barzilay and Lee’s are applications to aid humans in
             the work they do or should they be used instead of humans? What are the problems
             with the latter option and are there any ways in which other areas of NLP could be
             applied to assist?




                                               -6-
Ben Addley                                                              IR&NLP Coursework P1
1.     Introduction


What is Text Analysis? A very basic question which you would expect to have a very basic
answer. Unfortunately this particular area concerned with the textual understanding of a
document is more challenging, where a two-line definition simply doesn’t qualify.

In this paper I will introduce the technical and linguistic terminology necessary for an
understanding of the role Text Analysis plays in the two core application areas of Information
Retrieval (IR) and Natural language Processing (NLP). I will not cover the complicated and
highly technical processes, which go into making Text Analysis actually work. Instead presented
below constitutes a mere introduction and should allow the user to research areas of interest in
more depth.

In its rawest sense, text analysis is key to any IR or NLP process. We as users must know some
information abut what it is we are trying to retrieve, this comes from perhaps the title of the work,
the authors name, section or chapter headers and of course the content. As human beings we
are able to ingest and process this data in a number of sophisticated ways, the most important
of these is our cognitive ability to understand context and importance of certain words and
phrases. Text analysis, and in the case of this coursework, Automatic Text Analysis is the study
of how a computer can be used to identify the context and importance of text without the
intervention of a human.

Wouldn’t it be nice not to have to rely on others IT literacy when searching for documents via
search engines or the web? To be able to simply enter a phrase, question or query and have a
computer return not just a document that has those keywords in but somehow understands
what it was you were trying to look for and return only those documents you were actually
seeking.

These techniques, as well as being fundamental to IR can aid and extend the effectiveness of
other areas such as research within NLP; it can aid interpretation, textual navigation, translation,
speech based tools and facilitate word/phrase-spotting programs.

As our world becomes a smaller place with increased and more effective communication
networks, we have to utilise these techniques to add value to our interactions. Asking a natural
question and receiving a natural, sensible and contextual response from a machine is an
important goal in our global community. Text Analysis plays a fundamental role in achieving that
goal but it is a complicated one!

This coursework will attempt to answer some of the basic questions involved in how a computer
achieves this feat of Artificial Intelligence (AI). It will also look at where Text Analysis technology
currently resides within the field of IR & NLP and what the future might hold if we continue to
research and develop in this exciting and challenging area.

A question you may wish to bear in mind whilst reading this document is; Does Text Analysis
constitute actual understanding of the textual input or simply an electronic approximation of
understanding?


                                                 -7-
Ben Addley                                                                   IR&NLP Coursework P1
2.      History of Text Analysis
Text analysis is not something new, understanding what fundamentally links words, sentences
passages and even whole books has been of interest since the middle ages. The first threads of
text analysis can be followed back to medieval biblical scholarship. Intellectuals would try to find
parallels between the new and Old Testament where passages might be linked according to
places, periods and people. This resulted in the first concordances1, a tool still used in computer
based text analysis today.

The first advancement in pure computer based TA came during the 1950’s, however research
into computational linguistics and the “holy grail” of an understanding machine began in the
1940’s, right at the birth of modern computing.

2.1     The Innovators

“In 1949, Warren Weaver proposed that computers might be useful for ‘the solution of world-
wide translation problems’ and the resulting effort, called machine translation, attempted to
simulate with a computer the presumed functions of a human translator” [Avron Barr 1980]. IBM
first demonstrated a basic word for word translation machine in 1954 but these early attempts at
machine translation failed due to the simplistic idea that word equivalency techniques and
sentence re-ordering would suffice for a translation machine.

AI research took on new ideas, the most important of these being textual understanding. The
work of Chompsky in the field of linguistic theory coupled with advancements in programming
languages in the 1960’s heralded a surge in AI/NLP work. By the 1970’s the approach was to
model human language as knowledge based systems and to understand how language works
to build up rules that apply to these systems.

Terry Winograd in 1972, “groups natural language programs according to how they represent
and use knowledge of their subject matter” [Avron Barr 1980]. He proposed four historical
groupings based on this approach; the first programs during this period that attempted to
analyse and understand textual input were built on limited domains; BASEBALL, ELIZA etc.
Then followed systems that used semantic memories and indexing to retrieve and understand
words or phrases. A third approach followed during the mid to late sixties called limited logic
systems and finally a forth group, knowledge based systems, using first order logic and
semantic nets, such as William Woods’s LUNAR program and Winograds SHRDLU system.

More recently we have seen a shift away from the ideas of Winograd and the early pioneers of
textual understanding within NLP. With increased processor speeds and the advent of super-
computing, new methods and theories have been developed in example based machine
learning. Today the arguments rage about the respective merits of the two main approaches,
with new developments in hybrid systems combining the two.




1
 In its most basic form it’s an index that includes a line of context against each entry or occurrence of a
word.
                                                    -8-
Ben Addley                                                          IR&NLP Coursework P1
3.     The Building Blocks of Textual Understanding
The basic building block of text analysis is textual understanding and in particular the language
it is built upon. When we discuss the fundamentals of this subject we must also understand the
guiding principles of language makeup.

 We are unique from all other species of animal in that we don’t just communicate through
signals (as other creatures are capable of) but use sophisticated language properties to do so.
Artificial Intelligence (AI) is the branch of Computer Science that undertakes investigation and
development of models and tools to replicate this “behaviour” in machines. AI has been defined
as “the science of making machines do things that would require intelligence if done by men”
[Minsky 1968].

Winograd proposed a series of pre-defined stages that must be adhered to for computerised
understanding of language to occur. These follow closely the traditional linguistic approach to
word formation processes.

3.1    Rule Based Approach

Figure 1 (below), demonstrates this logical rule based approach to textual understanding by
parsing written language through a series of modular analytical components. The output of one
module acts as an input to the next and so on until the process has been completed. Each
analytical area concentrates on a particular component of natural language (morphemes, words,
syntax, semantics etc) and calls upon a number of pre-defined rules (shown in the ellipses) to
adjudge whether that component of text fits in with a particular rule. Once identified it will be
passed on to the next module for further analysis, if later (usually at the semantic or pragmatic
stage) it is proved to be incompatible or incorrect it will be passed back to a previous layer.

The disadvantage of such an approach is that it is language dependant and will need to be re-
modelled for each additional language you wanted to perform textual analysis on. The other
main problem is that it requires large amounts of initial human intervention in the programming
and rule development stage, this ties in the first problem. Such problems have an impact on
applications as we’ll see later in this paper.

3.2    Statistical Based Approach

Another name for this approach is machine learning or example based learning. It requires the
computer to parse huge amounts of text, and by comparing common compositions of texts
(usually domain specific), gradually “learns” through statistical analysis how syntax and certain
bigrams and trigrams of words and phrases within a language are formed. The advantage to
this method is it is language independent and requires little human intervention. In theory if you
have a large enough corpus in any language you can teach the system that language in a
relatively short period of time. The disadvantage comes from the scale and size of corpus
required, that contains enough compositions and varied bigrams and trigrams to develop a
sufficient understanding of language.




                                              -9-
Ben Addley                                                  IR&NLP Coursework P1




             Figure 1: Computerised Understang of Language, Winograd T. Stanford AI Series


                                     - 10 -
Ben Addley                                                                  IR&NLP Coursework P1
3.3     Ambiguity

Ambiguity increases the range of possible interpretations of natural language, and a computer
has to find a way to deal with this. [Inman D. 1997] This is another key issue for Text Analysis
as computers have to make choices on how interpretations of words and phrases are made.
This is an easier problem to overcome with an example based learning approach.



4.      Text Analysis: Technology Overview
Within Text Analysis we have seen the two approaches to the underlining technology (rule and
statistical based systems). There is of course a third way encompassing a hybrid method
combining the best elements of rule based and statistical based methodologies.

4.1     Pre-processing

When undertaking an analysis on target text it is helpful to carry out automatic pre-processing to
strip away the words with no semantic meaning. This can aid in Example based learning as
unusual or one off bigrams/trigrams are negated. Another approach is to carry out stemming,
this enables prefixes, suffixes and endings, also known as morphemes to be identified leaving
just the stem (or core meaning) of the word. For example learn is the stem of learning, by
concentrating on the stem both terms will be identified as having the same core and thereby
improving the analysis of the text.

4.2     Product Types

There is a rich variety of software that supports the general task of Text Analysis within the
different disciplines of Human Computer Interaction (HCI). Due to this variety it is helpful to set
out a brief general classification of the main areas. The below, taken from Harold Klein’s
discussions after the Acapulco ICA conference in 2000, breaks text analysis software down into
two broad areas. Firstly language and its makeup and secondly obviously content, which deals
with the “what” being communicated:

4.2.1   Language

Dealing with the use of language of which there are two further sub-categories;

        •    Linguistic: applications like parsing, lemmatising words2

        •    Data bank: information retrieval in texts, indexers, concordances, word lists,
             KWIC/KWOC (key-word-in-context, key-word-out of-context)

4.2.2   Content

Dealing with the content of human communication, mainly texts.

        •    Qualitative: looking for regularities and differences in text, exploring the whole text

2
 Lemmatising means grouping related words together under a single headword. A Lemmatiser (tool) allows you to
define groups of related words and then apply your groupings to words displayed in the Wordlist.

                                                   - 11 -
Ben Addley                                                           IR&NLP Coursework P1
             (QDA - qualitative data analysis). A few programs allow the processing of audio and
             video information also.

       •     Event data: analysis of events in textual data

       •     Quantitative: analyse the text selectively to test hypotheses and draw statistical
             inferences. Output is a data matrix that represents the numerical results of the
             coding.

                o Category systems: provided by the software developer (instrumental) or by
                    the researcher (representational), this is selective, only search patterns are
                    searched in the text and coded. Software packages with built-in dictionaries
                    are often language restricted, some have limits on the text unit size and are
                    restricted to process responses to open ended questions but not to analyse
                    mass media texts. The categories can be thematic or semantic; this can have
                    implications on the definition of text units and external variables.

                o No category system: using co-occurrences of words/strings and/or
                    concepts, these are displayed as graphs or dendrograms.

                o For coding responses to open ended questions only: these programs
                    cannot analyse huge amount of texts, they fit for rather homogeneous texts
                    only and are often limited in the size of a text unit.

                                                                                     [Klein 2002]

Obviously the above is just one interpretation of the many approaches taken in the field of Text
Analysis.



5.     Text Analysis Within the Field of Information Retrieval
So we now know a little about what text analysis is and the linguistic theory behind the concept.
The key question now, is how do we transform this powerful concept into products that can
actually help us in our day-to-day lives? Text analysis is already used in many commercial
systems and the first part of this section will investigate where the technology currently lies,
what it is used for and who is using it.

5.1    Distilling the Meaning of a Document

Making informed and ultimately correct decisions in our busy working life often requires
analysing large and time consuming volumes of textual information. Students, Researchers, and
professionals (such as Analysts, Lawyers and Editors) are faced by various TA tasks, all
requiring the extraction of the core meaning from the document.

In today’s “information rich” environment, huge piles of information build up in traditional
repositories held in libraries, businesses, individual PCs, and the of course the ever ubiquitous
World Wide Web. The amount of information being produced and stored is growing at an
extraordinary rate with some predictions stating that we will, at current growth rates, have more

                                               - 12 -
Ben Addley                                                            IR&NLP Coursework P1
information them atoms to store them on! Whether you believe the doomsday like prophecies or
not, it is a fact that the human beings are increasingly unable to meet the challenges of this
growth. “Mankind is searching for intelligent electronic assistants to help with text analysis
projects” [HALLoGRAM Publishing, 2000].

In particular we require help to derive the semantic value of a document in a concise form. Once
achieved we can apply the knowledge to a number of other applications, as we’ll discuss
throughout the rest of this section.

5.2    Text Based Navigation

To understand this application for Text Analysis we need understand semantic networks.
“Semantic networks are knowledge representation schemes involving nodes and links between
nodes” [Duke University]. Essentially a conceptual web of linked nodes that point to all other
nodes containing listed objects. “Concepts stored in the semantic network are hyper linked to
those sentences where they have been encountered, and the sentences are in turn hyper linked
to the places in the original text from where they have been retrieved” [Sergei Ananyan,
Alexander Kharlamov 2004].

By using Text Analysis principals to build semantic networks automatically we can efficiently
navigate through stored texts and useful linked documents, which is a core application within
the wider field of Information Retrieval. This can be done simultaneously to multiple documents
thus creating a very powerful IR tool. Direct applications for this can be found in website design
and navigation and I myself have investigated a version of this tool for my BSc final project. I
have used Microsoft Indexing Services (part of the MS IIS suite of administrative tools) to index
and analyse documents stored within a catalogue. It basically analyses the textual content of
documents, sorting it into categories, which can then search on using a simple query language.

5.2    Topic Structure

What do we mean when we talk about topic structure? Well, we can use text analysis to identify
the most relevant and significant concepts from the semantic network of the text and transform it
into a tree like structure of topics sorted by importance. The limbs of the tree structure represent
relations of headings and content in the text. Some of these limbs are strong and form the basis
of the structure, other are weaker and are often irrelevant or indirect and need to be replaced
with more direct ones. To take this paper as an example; I have an introduction with no nested
topics, however I then have a series of topics some with sub topics and areas of interest. This is
built up over the document as a whole to construct a visual structure to reveal a hierarchy of
themes within the text, which can then be used as a powerful information retrieval method.

Below is a tree like listing of this coursework as viewed through a topic structure. Main headings
hug the left hand side (red line) with secondary (blue) and tertiary (green) topics indented to the
right. Content text is omitted. This is a rather simplistic model but demonstrates the point.




                                               - 13 -
Ben Addley                          IR&NLP Coursework P1




                 Figure 2: Example of Topic Structure (Report Wide)


             - 14 -
Ben Addley                                                            IR&NLP Coursework P1
5.3     Clustering

Clustering is built upon the previous technology of topic structure but goes one stage further.
Topic structures sever the links representing weak relations in the text and substitutes certain
indirect relations with direct ones. With clustering on the other hand, those links that falls below
a pre-defined level of weakness/strength are eliminated altogether. This allows a break up of
texts that have been collected together to form individual groups that more clearly represent a
common subject or theme. This allows documents to be grouped into particular subject areas
facilitating searching, indexing and analysis on that theme as well as on the original topic
structure.

5.4     Automatic Textual Summarisation

“The goal of automatic summarisation is to take an information source, extract content from it,
and present the most important content to the user in a condensed form and in a manner
sensitive to the user’s or application’s need”. [Inderjeet Mani 2001]

Textual summarisation techniques have their historical roots in the 1960’s, however with the
proliferation of the Internet and growth in document production, new interest in the technology
has developed. Broadly speaking techniques can be divided into two categories;

5.4.1   Extraction

Extraction techniques in general, simply copy the information deemed most important into a
summary. There are numerous methods for carrying out Extraction summarisation, one
important and widely used methodology utilises algorithms to score individual sentences in the
target text. It is both robust and accurate involving the number of important semantic concepts
in a sentence. The larger the number and the stronger these concepts are, coupled with the
relationship they have with each other, the higher the semantic weight (or score) of the
sentence. The summarisation tool then sorts and collects only those sentences, which fall
above a pre-set score or weight, thus resulting in a truncated piece of text that summarises the
original…hopefully accurately!

“The size of the summary is controlled through changing the sentence selection threshold. An
advanced algorithm used for developing an accurate semantic network ensures the high quality
and relevance of the created summary”. [Sergei Ananyan, Alexander Kharlamov 2004] The
same concept can be applied to paragraphs and other units of text within documents.

5.4.2   Abstraction

Abstraction in its purest sense is the process of distancing objects from ideas. In the context of
summarisation it involves paraphrasing sections of the source document. Usually, abstraction
can condense a text more thoroughly than the extraction process discussed above, but the
programs that do this are generally harder to program and implement. An example of
abstraction in summarisation would be to “understand” the concept of a sentence or phrase.

So how do we use automatic textual summarization and what is it good for? There are
numerous practical applications and professions using the technology in everyday tasks.
Probably the most common of which is in the news and broadcasting industry, where
summarisation is used for newspaper articles and scientific and technological journals. Another

                                               - 15 -
Ben Addley                                                            IR&NLP Coursework P1
important and widely used application is in search engine technology and IR. Automatic textual
summarisation is a “cross over” application used in both the IR fields and NLP applications.



6.       Text Analysis Within the Field of Natural Language Processing
In this section I’ll discuss some of the realised applications and products for Text Analysis
technology and some current and future advancements. TA is just one part of the wider AI
problem within NLP that we face. Core to the issue is NLP and how we can develop systems
that are able to interpret the way we, as humans communicate. It is the basis of many other
technologies as well information retrieval, machine translation for example is based on the
textual analysis of data as are text to speech engines and language spotters, all discussed
further in this section.

6.1      Machine Translation

“Machine translation (MT) is the application of computers to the task of translating texts from
one natural language to another. One of the very earliest pursuits in computer science, MT has
proved to be an elusive goal, but today a number of systems are available which produce output
which, if not perfect, is of sufficient quality to be useful in a number of specific domains” [EAMT
June 2004]. MT is probably the most well known of text analysis applications in existence today
but still remains one of the most intangible. Although there have been many advances in the
field in recent years, products are still far from perfect and often can’t deal with truly natural
language such as colloquialisms.

The primary role of MT software is two fold;

      1. Provides Gisting (or rough translations) to non-native speakers of a particular language,
         enabling them to gain an understanding of the document and whether it is of relevance
         to them.

      2. As part of professional transcription and translation workbenches. This application allows
         users with knowledge of a language to filter and offer triages of large amounts of text.
         This can free up the professionals time, give them a head start on the target text and
         enable them to work more efficiently. There are other associated tools within this area
         such as translation memories and glossaries that also rely on text analysis techniques,
         however this falls outside of the main scope of the project.

6.2      Speech to Text Systems

I’m sure we’ve now all been exposed to the technological wonder that is Speech to text
systems. Every time we call up our bank or our gas supplier we are faced with the prospect of
an automated voice asking us to tell them, through speech, which service we want. After
repeatedly asking for our balance we are invariably put through to the section dealing with
selling services of one description or another! It’s true that the technology doesn’t seem to be
accurate but big business has identified it as a major efficiency saving, and that is the driver
behind it.

There are of course other, more refined applications for the technology, language recognition

                                               - 16 -
Ben Addley                                                                    IR&NLP Coursework P1
and speech recognition, biometric tools and speaker verification tools. There are some links to
companies providing these applications in the general web resources section.

6.3       Text to Speech Systems

One of the first aspects of natural language to be modelled was the actual articulation of speech
sounds. “Early models of ‘talking machines’ were essentially devices which mechanically
simulated the operation of the human vocal tract. More modern attempts to create speech
electronically, are generally referred to as speech synthesis”. [Yule G. The Study of Language;
Pg 115]

The concept is to take a text and using advanced TA techniques, tokenise it into phonemes that
make up the individual words. You then electronically reproduce the acoustic properties of those
phonemes into sounds that can be played. It is a little trickier then the over-simplified
explanation I have given but it demonstrates the idea. There are multiple uses such as:

          •    Call centre technology

          •    Mobile texting to landline phones - where the text message is translated into speech
               and transferred automatically to the designated landline connection. A working
               product developed by Loquendo (a subsidiary of Italia Telecom)

6.4       Chatterbots (Understander-Systems)

Things have come a long way from the early days of the Eliza in the 60’s and Michael Mauldin
from Carnegie Mellon University who coined the term “Chatterbot” in 1994. Basic Chatterbots
such as Eliza, Alice and Brian3 use a process of pattern recognition to analyse text and create
an illusion of understanding. Questions posed by the user triggered by occurrences of keywords
or phrases; activate a particular type of pre-determined response, usually a question,
incorporating that phrase or keyword in the answer.

6.5       Anti Plagiarism tools

Text Analysis and textual understanding is at the heart of most anti plagiarism tools. To take an
example from LSBU’s own efforts in this area, Thomas Lancaster (my old Java lecturer!) has
developed a number of tools, one of which is; “Text Analysis Tool (TAT), a system which
presents a rolling representation of the stylistic properties of a submission so find areas that are
likely to represent extra-corpal plagiarism.” [Lancaster 2002]

This is based around syntactic similarities in two target texts. Some tools will use this core
technique and provide visual representations of the results (VAST) while others will simply
present the two offending articles in a way that allows the user to make a decision more easily
on the material they have in front of them (SSS, TRANK). This is an obvious labour saving tool
for teachers and lecturers…and a source of fear and loathing by students the world over!




3
    Visit the AAAI website for an extensive listing of Chatterbots (see reference section)
                                                     - 17 -
Ben Addley                                                             IR&NLP Coursework P1


7.       What the Future Holds (in my opinion)
This is still an emerging and exciting area of Computer Science with research being undertaken
to improve quality all the areas mentioned in the sections above. There is a real sense that
there are still a number of advancements possible in automating a number of the process
currently undertaken by humans. There is also a business requirement driving research and
development. Regardless of whether we approve of it or not, we now live in a truly global village
with a need for fast, efficient and above all accurate tools to assist in our international electronic
trading environment.

Below is a summary of some of the major technological advancements I predict to occur in the
near to middle future. Some are logical progressions to the current technology available from
commercial companies. Others are slightly more experimental in that may require a radical
approach to the way we look at the problems before developing viable solutions.

     •   We should start to see more enterprise solutions with Text Analysis at the core. There is
         a lot of development in areas of speech to text and translation, with the goal being
         combined technologies transparent from the user, with seamless input of foreign speech
         to English textual output.

     •   Convergence of NLP and IR technologies with more efficient methods to search for
         information in other languages, understanding the inferences of our query and returning
         results in an intelligent way.

     •   Although English has gradually been accepted as the lingua franca in most industries
         there is a real need for translation tools, especially those dedicated to textual translation
         and the associated fields of retrieving that information. The EU alone dealt with
         1,416,817 pages of text translation relying on text analysis in 2003, a rise of 9.4% from
         2002 [EU DGT Annual Activity Report 2003] and other large bodies (both Governmental
         and NGO’s) are increasing budgets in the area to cope. This will be a major growth area
         and as such technological developments will follow.

     •   Email dialogue systems that take automatic input from mail servers, carries out a Text
         Analysis process, interrogates associated applications such as diary programs, address
         books etc, and outputs a response in the form of replies to original emails. An example
         would be meeting planners trying to arrange for everyone in a special interest group
         (SIG) to attend a quarterly meeting. One email is sent out by the chair suggesting dates,
         once received, each recipients computer carries out the process described above and
         returns an email to the chair with their availability. This may continue through a number
         of iterations before everyone is available, but the key thing is that the process has been
         automatic and independent of human interference.

There are of course many more examples, however, these seem logical and doable considering
the current level of knowledge and application of technology at the moment.




                                                - 18 -
Ben Addley                                                           IR&NLP Coursework P1
8.     Conclusions
As users and producers of information, we have created a spiral of ever increasing quantities of
stored data. With the amount of directly accessible data available to us today, we have had to
find new ways in which to manage and review this overabundance of textual information.

In this paper we have started to explore how a computer can be used to manage this problem
through identifying the context and importance of textual input. We’ve looked at the applications
of Text Analysis as part of NLP and within the field of Information Retrieval and explored
possible convergence in these areas. The key areas and technologies presented above are a
mere introduction and do not constitute an in depth view. By showing how they broadly connect
with Text Analysis and providing further resources (see reference section) I hope that the reader
will further research areas that directly interest them.

It is obvious to me from the study carried out for this project that the areas of text analysis and
textual understanding within AI is far from complete. Research and development continues in
Universities, Government and Industry, trying to achieve better, more natural responses to the
inputs we give machines. There are many approaches to this task and we’ve briefly looked at
rule-based analysis of text and also example based machine learning as well as hybrids of the
two.

At the beginning of this coursework report I posed a question; does Text Analysis constitute
actual understanding of the textual input or simply an electronic approximation of
understanding? The answer, in my opinion is that advancements in the technology have come a
long way in the past twenty years, but we have not developed, and are nowhere near
developing machines that understand in the way we understand. The question we should really
ask is; can we ever truly develop understanding in computers? I’ll conclude with the words of a
philosopher rather then a scientist.

       “It is a very remarkable fact that there are none so depraved and stupid, without
       even excepting of idiots, that they cannot arrange different words together,
       forming of them a statement by which they make known their thoughts; while on
       the other hand, there is no animal, however perfect and fortunately
       circumstanced it may be, which can do the same…”

                                                                 René Descartes (1637)

       …What would he have made of mans attempts to program machines to do the
       same!




                                              - 19 -
Ben Addley                                                            IR&NLP Coursework P1
9.        References


Eisenberg A. 2003

“Get Me Rewrite!” “Hold On, I’ll Pass You to the Computer.”

Article from The New York Times December 25, 2003



Inderjeet M. 2001.

Automatic Summarisation.

John Benjamins Publishing Company, Amsterdam/Philadephia.



Minsky M. L. 1968.

Semantic Information Processing

MIT Press



Yule G. 1985

The Study Of Language

Cambridge University Press



9.1       Web References

      o   Avron Barr

          http://www.aaai.org/Library/Magazine/Vol01/01-01/vol01-01.html

          Last Updated: Spring 1981

          Downloaded: 19, November 2004

          A paper entitled “Natural Language Understanding” by Barr of Stanford University.
          Although published many years ago it still has a wealth of useful information, especially
          good is the history of NLP (page 2)




                                                - 20 -
Ben Addley                                                              IR&NLP Coursework P1
   o   Duke State University

       http://www.duke.edu/~mccann/mwb/15semnet.htm

       Last Updated: Unknown

       Downloaded: 16, November 2004

   o   EAMT

       http://www.eamt.org/mt.html

       Last Updated: 3, June 2004

       Downloaded: 20, November 2004

       Home page of the European Association of Machine Translation, this site is a good
       guide and technical resource in the area of MT.

   o   European Union Annual Activity Report 2003

       http://europa.eu.int/geninfo/query/engine/search/query.pl(AAR 2003 - Report ONLY.doc)

       Last Updated: 1, April 2004

       Downloaded: 21, November 2004

       Report into EU translation activities – very dry but full of useful statistics.

   o   HALLoGRAM Publishing

       http://www.hallogram.com/textanalyst/

       Last Updated: Unknown

       Downloaded: 7, November 2004

       A commercial site promoting a product called textanalyst. Has numerous information and
       definition pages. Very good for an introduction to the subject.

   o   Dave Inman

       http://www.scism.sbu.ac.uk/inmandw/tutorials/nlp/ambiguity/ambiguity.html

       Last Updated: 25, February 1997

       Downloaded: 21, November 2004

       All pages within the NLP tutorial site are directly relevant to text analysis and the wider
       fields of IR and NLP. A very good place to start any research in the area.



                                                - 21 -
Ben Addley                                                            IR&NLP Coursework P1
   o    Harold Klein

        http://www.textanalysis.info/terms.htm

        Last Updated: 19, May 2002

        Downloaded: 30, October 2004

        A really good general site housing a huge range of information on text analysis. An
        excellent place to start any research into the subject.

   o    Thomas Lancaster

        http://www.radford.edu/~sigcse/DC01/participants/lancaster.html

        Last Updated: 2002

        Downloaded: 20, November 2004

        A former PhD student at LSBU, Thomas developed a number of tools in the area of anti
        plagiarism. For how this links in with TA in more detail either follow the above link or type
        his name and the term “anti plagiarism” into google.

   o    Sergei Ananyan, Alexander Kharlamov

        http://www.megaputer.com/tech/wp/tm.php3#nav

        Last Updated: 2004

        Downloaded: 7, November 2004

        Automated Analysis of Natural Language Texts white paper offering all round useful
        information.

9.1.1   General Web Resources

These links act as an additional resource on the subject of Text Analysis within Information
Retrieval. Most were used as general background reading for this coursework, some were
quoted above and some were not used at all.

   http://www.textanalysis.com/ - VisualText™ - part of Text Analysis International, Inc.

   •    Incorporated company providing products in Info Extraction and NLP. Very useful for the
        FAQ section: http://www.textanalysis.com/FAQs/faqs.html and the two main product
        sections    that   give    an   overview    of    current    technology   capabilities:
        http://www.textanalysis.com/Products/products.html

   http://www.intext.de/eindex.html - Social Science Consulting (English language)

   •    Very good site dedicated to all things TA. Has a very good history section and
        applications for this technology.


                                                 - 22 -
Ben Addley                                                            IR&NLP Coursework P1
   http://www.semantic-knowledge.com/tropes.htm - Semantic Knowledge Product Site

   •    Another product on the Market, this time from Semantic Knowledge. Tropes offers TA
        plus IR technologies as a joined up product. Not much in the way of background reading.
        There is some useful info on how the engine works though: http://www.semantic-
        knowledge.com/fonction.htm

   http://www.megaputer.com/tech/wp/tm.php3 -

   •    White paper from Megaputer™ (company offering TextAnalysis product). Good
        background and introduction to the subject. Very useful section on the history of the
        subject: http://www.megaputer.com/tech/wp/tm.php3#history and new opportunities
        looked at from the business perspective:

   http://www.scism.lsbu.ac.uk/inmandw/tutorials/nlp/index.html

   •    Paper by Dave Inman on the complexities and possibilities of NLP via Computers. Not
        specifically aimed at my coursework area but very useful questions raised and will be
        useful to add context to TA within the wider field of IR. The two key links off this page are
        “can computers understand language?” and “does the structure of language help NLP?”

   http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/0.html

   •    What appears to be a very useful repository on all things NLP from Carnegie Mellon
        University. It is fairly out of date stuff but good for general background

   http://www.loquendo.it/en/index.htm

   •    English homepage of Loquendo, an Italian company providing application tools involving
        speech to text and vice versa.

9.1.2   General Book Resources

        Obviously as well as the above links there is also a good grounding to be found in the
        core text book, FOA: A cognitive perspective on search engine technology and the www
        (Belew R. K). As well as this book there is also the secondary textbook, Information
        Retrieval (Van Rijsbergen C. J.).

        This second book is available on the www and has a specific section on Automatic Text
        Analysis: http://www.dcs.gla.ac.uk/Keith/Preface.html




                                               - 23 -

Mais conteúdo relacionado

Destaque (16)

MACHINE LEARNING
MACHINE LEARNINGMACHINE LEARNING
MACHINE LEARNING
 
SATANJEEV BANERJEE
SATANJEEV BANERJEESATANJEEV BANERJEE
SATANJEEV BANERJEE
 
Hierarchical RL (DAI).ppt
Hierarchical RL (DAI).pptHierarchical RL (DAI).ppt
Hierarchical RL (DAI).ppt
 
News Press Ireland Tsb Rc1
News Press Ireland Tsb Rc1News Press Ireland Tsb Rc1
News Press Ireland Tsb Rc1
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Meet Ukraine - full version
Meet Ukraine - full versionMeet Ukraine - full version
Meet Ukraine - full version
 
La Renaissance
La RenaissanceLa Renaissance
La Renaissance
 
MemFunc.doc
MemFunc.docMemFunc.doc
MemFunc.doc
 
chapter 8
chapter 8chapter 8
chapter 8
 
MGT-350 Russell.docx - Cameron School of Business - University of ...
MGT-350 Russell.docx - Cameron School of Business - University of ...MGT-350 Russell.docx - Cameron School of Business - University of ...
MGT-350 Russell.docx - Cameron School of Business - University of ...
 
PSU LMS RFP FINAL.docx - PORTLAND STATE UNIVERSITY
PSU LMS RFP FINAL.docx - PORTLAND STATE UNIVERSITYPSU LMS RFP FINAL.docx - PORTLAND STATE UNIVERSITY
PSU LMS RFP FINAL.docx - PORTLAND STATE UNIVERSITY
 
ppt
pptppt
ppt
 
Evaluating the Security of Machine Learning Algorithms
Evaluating the Security of Machine Learning AlgorithmsEvaluating the Security of Machine Learning Algorithms
Evaluating the Security of Machine Learning Algorithms
 
Gitpractice01
Gitpractice01Gitpractice01
Gitpractice01
 
Chapter6.doc
Chapter6.docChapter6.doc
Chapter6.doc
 
0
00
0
 

Semelhante a Notes.doc.doc

Proposal e learning
Proposal e learningProposal e learning
Proposal e learning
sablahhh
 
thesis_Zhiyuan Lin
thesis_Zhiyuan Linthesis_Zhiyuan Lin
thesis_Zhiyuan Lin
Zhiyuan Lin
 
Literacy teaching and learning in e learning contexts
Literacy teaching and learning in e learning contextsLiteracy teaching and learning in e learning contexts
Literacy teaching and learning in e learning contexts
SpecialK13
 
Learning Management Systems (Lms) A Review
Learning Management Systems (Lms) A ReviewLearning Management Systems (Lms) A Review
Learning Management Systems (Lms) A Review
Hidayathulla NS
 
Feasibility Report on the use of Robotics in the Surgica
Feasibility Report on the use of Robotics in the SurgicaFeasibility Report on the use of Robotics in the Surgica
Feasibility Report on the use of Robotics in the Surgica
ChereCheek752
 
Mediatech Experiential Process Report 20160226f
Mediatech Experiential Process Report 20160226fMediatech Experiential Process Report 20160226f
Mediatech Experiential Process Report 20160226f
Tlhologelo Mphahlele
 
Why And Ontology Engine Drives The Point Cross Orchestra Engine
Why And Ontology Engine Drives The Point Cross Orchestra EngineWhy And Ontology Engine Drives The Point Cross Orchestra Engine
Why And Ontology Engine Drives The Point Cross Orchestra Engine
Kuzinski
 

Semelhante a Notes.doc.doc (20)

HRMS
HRMSHRMS
HRMS
 
Context Metadata for e-Learning Environments (Richter 2007)
Context Metadata for e-Learning Environments (Richter 2007)Context Metadata for e-Learning Environments (Richter 2007)
Context Metadata for e-Learning Environments (Richter 2007)
 
Proposal e learning
Proposal e learningProposal e learning
Proposal e learning
 
EMBAThesis_MaSu_Aug2008
EMBAThesis_MaSu_Aug2008EMBAThesis_MaSu_Aug2008
EMBAThesis_MaSu_Aug2008
 
thesis_Zhiyuan Lin
thesis_Zhiyuan Linthesis_Zhiyuan Lin
thesis_Zhiyuan Lin
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by Iqbal
 
01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad
 
my_thesis
my_thesismy_thesis
my_thesis
 
Software engineering marsic
Software engineering   marsicSoftware engineering   marsic
Software engineering marsic
 
Python for informatics
Python for informaticsPython for informatics
Python for informatics
 
PYthon
PYthonPYthon
PYthon
 
Literacy teaching and learning in e learning contexts
Literacy teaching and learning in e learning contextsLiteracy teaching and learning in e learning contexts
Literacy teaching and learning in e learning contexts
 
Learning Management Systems (Lms) A Review
Learning Management Systems (Lms) A ReviewLearning Management Systems (Lms) A Review
Learning Management Systems (Lms) A Review
 
Feasibility Report on the use of Robotics in the Surgica
Feasibility Report on the use of Robotics in the SurgicaFeasibility Report on the use of Robotics in the Surgica
Feasibility Report on the use of Robotics in the Surgica
 
Mediatech Experiential Process Report 20160226f
Mediatech Experiential Process Report 20160226fMediatech Experiential Process Report 20160226f
Mediatech Experiential Process Report 20160226f
 
Heavy duty oracle primavera usage in enterprise environment
Heavy duty oracle primavera usage in enterprise environmentHeavy duty oracle primavera usage in enterprise environment
Heavy duty oracle primavera usage in enterprise environment
 
Sales and operations planning a research synthesis
Sales and operations planning  a research synthesisSales and operations planning  a research synthesis
Sales and operations planning a research synthesis
 
Why And Ontology Engine Drives The Point Cross Orchestra Engine
Why And Ontology Engine Drives The Point Cross Orchestra EngineWhy And Ontology Engine Drives The Point Cross Orchestra Engine
Why And Ontology Engine Drives The Point Cross Orchestra Engine
 
Why And Ontology Engine Drives The Point Cross Orchestra Engine
Why And Ontology Engine Drives The Point Cross Orchestra EngineWhy And Ontology Engine Drives The Point Cross Orchestra Engine
Why And Ontology Engine Drives The Point Cross Orchestra Engine
 
Cs344 project
Cs344 projectCs344 project
Cs344 project
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Notes.doc.doc

  • 1. IR&NLP Coursework P1 Text Analysis Within The Fields Of Information Retrieval and Natural Language Processing By Ben Addley 2003695 Academic Year 2004 - 2005
  • 2. Ben Addley IR&NLP Coursework P1 Abstract As users and producers of information, we have created a spiral of ever increasing quantities of stored data. With the amount of directly accessible data available to us today, we need to find new ways in which to manage and review this overabundance of textual information. Text analysis, and in the case of this coursework, Automatic Text Analysis is the study of how a computer can be used to identify the context and importance of text without the intervention of a human. This paper introduces the technical and linguistic terminology necessary for an understanding of the role Text Analysis plays in the two core application areas of Information Retrieval (IR) and Natural language Processing (NLP). Text analysis is not something new, understanding what fundamentally links words, sentences passages and even whole books has been of interest since the middle ages. The first advancement in pure computer based TA came during the 1950’s, however research into computational linguistics and the “holy grail” of an understanding machine began in the 1940’s, right at the birth of modern computing. The basic building block of text analysis is textual understanding and in particular the language it is built upon. Text analysis is based around two differing approaches; Rule based where text is parsed through a series of modular analytical components, each analytical area concentrates on a particular component of natural language (morphemes, words, syntax, semantics etc) and calls upon a number of pre-defined rules. The other is Statistical based learning (sometimes referred to as machine or example based learning). It requires the computer to take in huge amounts of text, and by comparing common compositions of texts, gradually “learns” (through statistical analysis) how language is formed. As well as the two approaches to the underlining technology (rule and statistical based systems), a “third way” exists in hybrid systems combining the two to achieve more effective systems. This is still an emerging and exciting area of Computer Science with research being undertaken to improve quality all areas. There is a real sense that there are still a number of advancements possible in automating a number of the process currently undertaken by humans. There is also a business requirement driving research and development. Regardless of whether we approve of it or not, we now live in a truly global village with a need for fast, efficient and above all accurate tools to assist in our international trading environment. Keywords: Text Analysis, Textual Understanding, Information Retrieval, Natural Language Processing, Example Based Machine Learning, Rule Based Machine Learning. -2-
  • 3. Ben Addley IR&NLP Coursework P1 Contents ACADEMIC YEAR 2004 - 2005................................................................................................................. 1 ABSTRACT.................................................................................................................................................. 2 CONTENTS.................................................................................................................................................. 3 ....................................................................................................................................................................... 3 DISCUSSION NOTES................................................................................................................................. 5 QUESTIONS FOR DISCUSSION............................................................................................................. 6 1. INTRODUCTION.................................................................................................................................... 7 2. HISTORY OF TEXT ANALYSIS.......................................................................................................... 8 2.1 THE INNOVATORS........................................................................................................................................ 8 3. THE BUILDING BLOCKS OF TEXTUAL UNDERSTANDING..................................................... 9 3.1 RULE BASED APPROACH.............................................................................................................................. 9 3.2 STATISTICAL BASED APPROACH.................................................................................................................... 9 3.3 AMBIGUITY............................................................................................................................................... 11 4. TEXT ANALYSIS: TECHNOLOGY OVERVIEW.......................................................................... 11 4.1 PRE-PROCESSING....................................................................................................................................... 11 4.2 PRODUCT TYPES....................................................................................................................................... 11 -3-
  • 4. Ben Addley IR&NLP Coursework P1 4.2.1 LANGUAGE ............................................................................................................................................ 11 4.2.2 CONTENT .............................................................................................................................................. 11 5. TEXT ANALYSIS WITHIN THE FIELD OF INFORMATION RETRIEVAL ...........................12 5.1 DISTILLING THE MEANING OF A DOCUMENT................................................................................................ 12 5.2 TEXT BASED NAVIGATION......................................................................................................................... 13 5.2 TOPIC STRUCTURE ................................................................................................................................... 13 5.3 CLUSTERING............................................................................................................................................. 15 5.4 AUTOMATIC TEXTUAL SUMMARISATION...................................................................................................... 15 5.4.1 EXTRACTION .......................................................................................................................................... 15 5.4.2 ABSTRACTION ........................................................................................................................................ 15 6. TEXT ANALYSIS WITHIN THE FIELD OF NATURAL LANGUAGE PROCESSING............ 16 6.1 MACHINE TRANSLATION............................................................................................................................ 16 6.2 SPEECH TO TEXT SYSTEMS........................................................................................................................ 16 6.3 TEXT TO SPEECH SYSTEMS........................................................................................................................ 17 6.4 CHATTERBOTS (UNDERSTANDER-SYSTEMS).................................................................................................. 17 6.5 ANTI PLAGIARISM TOOLS........................................................................................................................... 17 7. WHAT THE FUTURE HOLDS (IN MY OPINION)......................................................................... 18 8. CONCLUSIONS.................................................................................................................................... 19 9. REFERENCES....................................................................................................................................... 20 9.1 WEB REFERENCES.................................................................................................................................... 20 9.1.1 GENERAL WEB RESOURCES...................................................................................................................... 22 9.1.2 GENERAL BOOK RESOURCES..................................................................................................................... 23 FIGURE 1: COMPUTERISED UNDERSTANG OF LANGUAGE, WINOGRAD T. STANFORD AI SERIES.................................................................................................................................................. 10 FIGURE 2: EXAMPLE OF TOPIC STRUCTURE (REPORT WIDE)..............................................14 -4-
  • 5. Ben Addley IR&NLP Coursework P1 Discussion Notes Article from The New York Times December 25, 2003: ‘Get Me Rewrite!’ ‘Hold On, I’ll Pass You the Computer.’ By Anne Eisenberg. In the famous sketch from the TV show “Monty Python’s Flying Circus,” the actor John Cleese had many ways of saying a parrot was dead, among them, “This parrot is no more,” “He’s expired and gone to meet his maker,” and “His metabolic processes are now history.” Computers can’t do nearly that well at paraphrasing. English sentences with the same meaning take so many different forms that it has been difficult to get computers to recognize paraphrases, much less produce them. Now, using several methods, including statistical techniques borrowed from gene analysis, two researchers have created a program that can automatically generate paraphrases of English sentences. The program gathers text from online news services on specific subjects, learns the characteristic patterns of sentences in these groupings and then uses those patterns to create new sentences that give equivalent information in different words. The researchers, Regina Barzilay, an assistant professor in the department of electrical engineering and computer science at the Massachusetts Institute of Technology, and Lillian Lee, an associate professor of computer science at Cornell University, said that while the program would not yield paraphrases as zany as those in the Monty Python sketch, it is fairly adept at rewording the flat cadences of news service prose. Give it a sentence like “The surprise bombing injured 20 people, 5 of them seriously,” Dr. Barzilay said, and it can match it to equivalent patterns in its databank and then produce a handful of paraphrases. For instance, it might come up with “Twenty people were wounded in the explosion, among them five in serious condition.” Programs that can detect or crank out multiple paraphrases for English sentences could one day have wide use. They might help create summaries of reports or check a document for repetition or plagiarism. Questions typed into a computer with such a program might in the future be automatically paraphrased to make it easier for a search engine to find data. Such programs might even be an aid to writers who want to adapt their prose to the background of their readers. Dr. Lee said the researchers had thought about using it “as a kind of ‘style dial’ “ to rewrite documents automatically for different groups - adapting articles on technical subjects for a children’s encyclopaedia, for example. She cautioned, however, that the work was preliminary and much more research was needed before it might be available for practical use. Fernando Pereira, chairman of the computer and information science department at the University of Pennsylvania said that the paraphrasing work had given him pause. “It’s a little bit humbling if you have the idea that we are creative when we write,” he said, only to discover that one’s special turns of phrase have already been tried by hundreds of other writers and can be found online. “The real insight of this work,” he said, “is that if there is a way of saying something, someone has already said it.” -5-
  • 6. Ben Addley IR&NLP Coursework P1 Questions for Discussion 1. The program outlined in the article above gathers text from online news services, learns the characteristic patterns of sentences in these groupings and then uses those patterns to create new sentences that give equivalent information in different words. Do you think a program like this can only operate within a limited domain like news reports or can they be applied to wider subjects? 2. What are your thoughts on Dr Fernando Pereira’s comment; “The real insight of this work, is that if there is a way of saying something, someone has already said it.” 3. What are the potential problems with applications that adapt prose or paraphrase sentences? Do you think that the context or subtle meaning might be lost if parsed through software as described in the article? 4. Do you think that tools such as Barzilay and Lee’s are applications to aid humans in the work they do or should they be used instead of humans? What are the problems with the latter option and are there any ways in which other areas of NLP could be applied to assist? -6-
  • 7. Ben Addley IR&NLP Coursework P1 1. Introduction What is Text Analysis? A very basic question which you would expect to have a very basic answer. Unfortunately this particular area concerned with the textual understanding of a document is more challenging, where a two-line definition simply doesn’t qualify. In this paper I will introduce the technical and linguistic terminology necessary for an understanding of the role Text Analysis plays in the two core application areas of Information Retrieval (IR) and Natural language Processing (NLP). I will not cover the complicated and highly technical processes, which go into making Text Analysis actually work. Instead presented below constitutes a mere introduction and should allow the user to research areas of interest in more depth. In its rawest sense, text analysis is key to any IR or NLP process. We as users must know some information abut what it is we are trying to retrieve, this comes from perhaps the title of the work, the authors name, section or chapter headers and of course the content. As human beings we are able to ingest and process this data in a number of sophisticated ways, the most important of these is our cognitive ability to understand context and importance of certain words and phrases. Text analysis, and in the case of this coursework, Automatic Text Analysis is the study of how a computer can be used to identify the context and importance of text without the intervention of a human. Wouldn’t it be nice not to have to rely on others IT literacy when searching for documents via search engines or the web? To be able to simply enter a phrase, question or query and have a computer return not just a document that has those keywords in but somehow understands what it was you were trying to look for and return only those documents you were actually seeking. These techniques, as well as being fundamental to IR can aid and extend the effectiveness of other areas such as research within NLP; it can aid interpretation, textual navigation, translation, speech based tools and facilitate word/phrase-spotting programs. As our world becomes a smaller place with increased and more effective communication networks, we have to utilise these techniques to add value to our interactions. Asking a natural question and receiving a natural, sensible and contextual response from a machine is an important goal in our global community. Text Analysis plays a fundamental role in achieving that goal but it is a complicated one! This coursework will attempt to answer some of the basic questions involved in how a computer achieves this feat of Artificial Intelligence (AI). It will also look at where Text Analysis technology currently resides within the field of IR & NLP and what the future might hold if we continue to research and develop in this exciting and challenging area. A question you may wish to bear in mind whilst reading this document is; Does Text Analysis constitute actual understanding of the textual input or simply an electronic approximation of understanding? -7-
  • 8. Ben Addley IR&NLP Coursework P1 2. History of Text Analysis Text analysis is not something new, understanding what fundamentally links words, sentences passages and even whole books has been of interest since the middle ages. The first threads of text analysis can be followed back to medieval biblical scholarship. Intellectuals would try to find parallels between the new and Old Testament where passages might be linked according to places, periods and people. This resulted in the first concordances1, a tool still used in computer based text analysis today. The first advancement in pure computer based TA came during the 1950’s, however research into computational linguistics and the “holy grail” of an understanding machine began in the 1940’s, right at the birth of modern computing. 2.1 The Innovators “In 1949, Warren Weaver proposed that computers might be useful for ‘the solution of world- wide translation problems’ and the resulting effort, called machine translation, attempted to simulate with a computer the presumed functions of a human translator” [Avron Barr 1980]. IBM first demonstrated a basic word for word translation machine in 1954 but these early attempts at machine translation failed due to the simplistic idea that word equivalency techniques and sentence re-ordering would suffice for a translation machine. AI research took on new ideas, the most important of these being textual understanding. The work of Chompsky in the field of linguistic theory coupled with advancements in programming languages in the 1960’s heralded a surge in AI/NLP work. By the 1970’s the approach was to model human language as knowledge based systems and to understand how language works to build up rules that apply to these systems. Terry Winograd in 1972, “groups natural language programs according to how they represent and use knowledge of their subject matter” [Avron Barr 1980]. He proposed four historical groupings based on this approach; the first programs during this period that attempted to analyse and understand textual input were built on limited domains; BASEBALL, ELIZA etc. Then followed systems that used semantic memories and indexing to retrieve and understand words or phrases. A third approach followed during the mid to late sixties called limited logic systems and finally a forth group, knowledge based systems, using first order logic and semantic nets, such as William Woods’s LUNAR program and Winograds SHRDLU system. More recently we have seen a shift away from the ideas of Winograd and the early pioneers of textual understanding within NLP. With increased processor speeds and the advent of super- computing, new methods and theories have been developed in example based machine learning. Today the arguments rage about the respective merits of the two main approaches, with new developments in hybrid systems combining the two. 1 In its most basic form it’s an index that includes a line of context against each entry or occurrence of a word. -8-
  • 9. Ben Addley IR&NLP Coursework P1 3. The Building Blocks of Textual Understanding The basic building block of text analysis is textual understanding and in particular the language it is built upon. When we discuss the fundamentals of this subject we must also understand the guiding principles of language makeup. We are unique from all other species of animal in that we don’t just communicate through signals (as other creatures are capable of) but use sophisticated language properties to do so. Artificial Intelligence (AI) is the branch of Computer Science that undertakes investigation and development of models and tools to replicate this “behaviour” in machines. AI has been defined as “the science of making machines do things that would require intelligence if done by men” [Minsky 1968]. Winograd proposed a series of pre-defined stages that must be adhered to for computerised understanding of language to occur. These follow closely the traditional linguistic approach to word formation processes. 3.1 Rule Based Approach Figure 1 (below), demonstrates this logical rule based approach to textual understanding by parsing written language through a series of modular analytical components. The output of one module acts as an input to the next and so on until the process has been completed. Each analytical area concentrates on a particular component of natural language (morphemes, words, syntax, semantics etc) and calls upon a number of pre-defined rules (shown in the ellipses) to adjudge whether that component of text fits in with a particular rule. Once identified it will be passed on to the next module for further analysis, if later (usually at the semantic or pragmatic stage) it is proved to be incompatible or incorrect it will be passed back to a previous layer. The disadvantage of such an approach is that it is language dependant and will need to be re- modelled for each additional language you wanted to perform textual analysis on. The other main problem is that it requires large amounts of initial human intervention in the programming and rule development stage, this ties in the first problem. Such problems have an impact on applications as we’ll see later in this paper. 3.2 Statistical Based Approach Another name for this approach is machine learning or example based learning. It requires the computer to parse huge amounts of text, and by comparing common compositions of texts (usually domain specific), gradually “learns” through statistical analysis how syntax and certain bigrams and trigrams of words and phrases within a language are formed. The advantage to this method is it is language independent and requires little human intervention. In theory if you have a large enough corpus in any language you can teach the system that language in a relatively short period of time. The disadvantage comes from the scale and size of corpus required, that contains enough compositions and varied bigrams and trigrams to develop a sufficient understanding of language. -9-
  • 10. Ben Addley IR&NLP Coursework P1 Figure 1: Computerised Understang of Language, Winograd T. Stanford AI Series - 10 -
  • 11. Ben Addley IR&NLP Coursework P1 3.3 Ambiguity Ambiguity increases the range of possible interpretations of natural language, and a computer has to find a way to deal with this. [Inman D. 1997] This is another key issue for Text Analysis as computers have to make choices on how interpretations of words and phrases are made. This is an easier problem to overcome with an example based learning approach. 4. Text Analysis: Technology Overview Within Text Analysis we have seen the two approaches to the underlining technology (rule and statistical based systems). There is of course a third way encompassing a hybrid method combining the best elements of rule based and statistical based methodologies. 4.1 Pre-processing When undertaking an analysis on target text it is helpful to carry out automatic pre-processing to strip away the words with no semantic meaning. This can aid in Example based learning as unusual or one off bigrams/trigrams are negated. Another approach is to carry out stemming, this enables prefixes, suffixes and endings, also known as morphemes to be identified leaving just the stem (or core meaning) of the word. For example learn is the stem of learning, by concentrating on the stem both terms will be identified as having the same core and thereby improving the analysis of the text. 4.2 Product Types There is a rich variety of software that supports the general task of Text Analysis within the different disciplines of Human Computer Interaction (HCI). Due to this variety it is helpful to set out a brief general classification of the main areas. The below, taken from Harold Klein’s discussions after the Acapulco ICA conference in 2000, breaks text analysis software down into two broad areas. Firstly language and its makeup and secondly obviously content, which deals with the “what” being communicated: 4.2.1 Language Dealing with the use of language of which there are two further sub-categories; • Linguistic: applications like parsing, lemmatising words2 • Data bank: information retrieval in texts, indexers, concordances, word lists, KWIC/KWOC (key-word-in-context, key-word-out of-context) 4.2.2 Content Dealing with the content of human communication, mainly texts. • Qualitative: looking for regularities and differences in text, exploring the whole text 2 Lemmatising means grouping related words together under a single headword. A Lemmatiser (tool) allows you to define groups of related words and then apply your groupings to words displayed in the Wordlist. - 11 -
  • 12. Ben Addley IR&NLP Coursework P1 (QDA - qualitative data analysis). A few programs allow the processing of audio and video information also. • Event data: analysis of events in textual data • Quantitative: analyse the text selectively to test hypotheses and draw statistical inferences. Output is a data matrix that represents the numerical results of the coding. o Category systems: provided by the software developer (instrumental) or by the researcher (representational), this is selective, only search patterns are searched in the text and coded. Software packages with built-in dictionaries are often language restricted, some have limits on the text unit size and are restricted to process responses to open ended questions but not to analyse mass media texts. The categories can be thematic or semantic; this can have implications on the definition of text units and external variables. o No category system: using co-occurrences of words/strings and/or concepts, these are displayed as graphs or dendrograms. o For coding responses to open ended questions only: these programs cannot analyse huge amount of texts, they fit for rather homogeneous texts only and are often limited in the size of a text unit. [Klein 2002] Obviously the above is just one interpretation of the many approaches taken in the field of Text Analysis. 5. Text Analysis Within the Field of Information Retrieval So we now know a little about what text analysis is and the linguistic theory behind the concept. The key question now, is how do we transform this powerful concept into products that can actually help us in our day-to-day lives? Text analysis is already used in many commercial systems and the first part of this section will investigate where the technology currently lies, what it is used for and who is using it. 5.1 Distilling the Meaning of a Document Making informed and ultimately correct decisions in our busy working life often requires analysing large and time consuming volumes of textual information. Students, Researchers, and professionals (such as Analysts, Lawyers and Editors) are faced by various TA tasks, all requiring the extraction of the core meaning from the document. In today’s “information rich” environment, huge piles of information build up in traditional repositories held in libraries, businesses, individual PCs, and the of course the ever ubiquitous World Wide Web. The amount of information being produced and stored is growing at an extraordinary rate with some predictions stating that we will, at current growth rates, have more - 12 -
  • 13. Ben Addley IR&NLP Coursework P1 information them atoms to store them on! Whether you believe the doomsday like prophecies or not, it is a fact that the human beings are increasingly unable to meet the challenges of this growth. “Mankind is searching for intelligent electronic assistants to help with text analysis projects” [HALLoGRAM Publishing, 2000]. In particular we require help to derive the semantic value of a document in a concise form. Once achieved we can apply the knowledge to a number of other applications, as we’ll discuss throughout the rest of this section. 5.2 Text Based Navigation To understand this application for Text Analysis we need understand semantic networks. “Semantic networks are knowledge representation schemes involving nodes and links between nodes” [Duke University]. Essentially a conceptual web of linked nodes that point to all other nodes containing listed objects. “Concepts stored in the semantic network are hyper linked to those sentences where they have been encountered, and the sentences are in turn hyper linked to the places in the original text from where they have been retrieved” [Sergei Ananyan, Alexander Kharlamov 2004]. By using Text Analysis principals to build semantic networks automatically we can efficiently navigate through stored texts and useful linked documents, which is a core application within the wider field of Information Retrieval. This can be done simultaneously to multiple documents thus creating a very powerful IR tool. Direct applications for this can be found in website design and navigation and I myself have investigated a version of this tool for my BSc final project. I have used Microsoft Indexing Services (part of the MS IIS suite of administrative tools) to index and analyse documents stored within a catalogue. It basically analyses the textual content of documents, sorting it into categories, which can then search on using a simple query language. 5.2 Topic Structure What do we mean when we talk about topic structure? Well, we can use text analysis to identify the most relevant and significant concepts from the semantic network of the text and transform it into a tree like structure of topics sorted by importance. The limbs of the tree structure represent relations of headings and content in the text. Some of these limbs are strong and form the basis of the structure, other are weaker and are often irrelevant or indirect and need to be replaced with more direct ones. To take this paper as an example; I have an introduction with no nested topics, however I then have a series of topics some with sub topics and areas of interest. This is built up over the document as a whole to construct a visual structure to reveal a hierarchy of themes within the text, which can then be used as a powerful information retrieval method. Below is a tree like listing of this coursework as viewed through a topic structure. Main headings hug the left hand side (red line) with secondary (blue) and tertiary (green) topics indented to the right. Content text is omitted. This is a rather simplistic model but demonstrates the point. - 13 -
  • 14. Ben Addley IR&NLP Coursework P1 Figure 2: Example of Topic Structure (Report Wide) - 14 -
  • 15. Ben Addley IR&NLP Coursework P1 5.3 Clustering Clustering is built upon the previous technology of topic structure but goes one stage further. Topic structures sever the links representing weak relations in the text and substitutes certain indirect relations with direct ones. With clustering on the other hand, those links that falls below a pre-defined level of weakness/strength are eliminated altogether. This allows a break up of texts that have been collected together to form individual groups that more clearly represent a common subject or theme. This allows documents to be grouped into particular subject areas facilitating searching, indexing and analysis on that theme as well as on the original topic structure. 5.4 Automatic Textual Summarisation “The goal of automatic summarisation is to take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s or application’s need”. [Inderjeet Mani 2001] Textual summarisation techniques have their historical roots in the 1960’s, however with the proliferation of the Internet and growth in document production, new interest in the technology has developed. Broadly speaking techniques can be divided into two categories; 5.4.1 Extraction Extraction techniques in general, simply copy the information deemed most important into a summary. There are numerous methods for carrying out Extraction summarisation, one important and widely used methodology utilises algorithms to score individual sentences in the target text. It is both robust and accurate involving the number of important semantic concepts in a sentence. The larger the number and the stronger these concepts are, coupled with the relationship they have with each other, the higher the semantic weight (or score) of the sentence. The summarisation tool then sorts and collects only those sentences, which fall above a pre-set score or weight, thus resulting in a truncated piece of text that summarises the original…hopefully accurately! “The size of the summary is controlled through changing the sentence selection threshold. An advanced algorithm used for developing an accurate semantic network ensures the high quality and relevance of the created summary”. [Sergei Ananyan, Alexander Kharlamov 2004] The same concept can be applied to paragraphs and other units of text within documents. 5.4.2 Abstraction Abstraction in its purest sense is the process of distancing objects from ideas. In the context of summarisation it involves paraphrasing sections of the source document. Usually, abstraction can condense a text more thoroughly than the extraction process discussed above, but the programs that do this are generally harder to program and implement. An example of abstraction in summarisation would be to “understand” the concept of a sentence or phrase. So how do we use automatic textual summarization and what is it good for? There are numerous practical applications and professions using the technology in everyday tasks. Probably the most common of which is in the news and broadcasting industry, where summarisation is used for newspaper articles and scientific and technological journals. Another - 15 -
  • 16. Ben Addley IR&NLP Coursework P1 important and widely used application is in search engine technology and IR. Automatic textual summarisation is a “cross over” application used in both the IR fields and NLP applications. 6. Text Analysis Within the Field of Natural Language Processing In this section I’ll discuss some of the realised applications and products for Text Analysis technology and some current and future advancements. TA is just one part of the wider AI problem within NLP that we face. Core to the issue is NLP and how we can develop systems that are able to interpret the way we, as humans communicate. It is the basis of many other technologies as well information retrieval, machine translation for example is based on the textual analysis of data as are text to speech engines and language spotters, all discussed further in this section. 6.1 Machine Translation “Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains” [EAMT June 2004]. MT is probably the most well known of text analysis applications in existence today but still remains one of the most intangible. Although there have been many advances in the field in recent years, products are still far from perfect and often can’t deal with truly natural language such as colloquialisms. The primary role of MT software is two fold; 1. Provides Gisting (or rough translations) to non-native speakers of a particular language, enabling them to gain an understanding of the document and whether it is of relevance to them. 2. As part of professional transcription and translation workbenches. This application allows users with knowledge of a language to filter and offer triages of large amounts of text. This can free up the professionals time, give them a head start on the target text and enable them to work more efficiently. There are other associated tools within this area such as translation memories and glossaries that also rely on text analysis techniques, however this falls outside of the main scope of the project. 6.2 Speech to Text Systems I’m sure we’ve now all been exposed to the technological wonder that is Speech to text systems. Every time we call up our bank or our gas supplier we are faced with the prospect of an automated voice asking us to tell them, through speech, which service we want. After repeatedly asking for our balance we are invariably put through to the section dealing with selling services of one description or another! It’s true that the technology doesn’t seem to be accurate but big business has identified it as a major efficiency saving, and that is the driver behind it. There are of course other, more refined applications for the technology, language recognition - 16 -
  • 17. Ben Addley IR&NLP Coursework P1 and speech recognition, biometric tools and speaker verification tools. There are some links to companies providing these applications in the general web resources section. 6.3 Text to Speech Systems One of the first aspects of natural language to be modelled was the actual articulation of speech sounds. “Early models of ‘talking machines’ were essentially devices which mechanically simulated the operation of the human vocal tract. More modern attempts to create speech electronically, are generally referred to as speech synthesis”. [Yule G. The Study of Language; Pg 115] The concept is to take a text and using advanced TA techniques, tokenise it into phonemes that make up the individual words. You then electronically reproduce the acoustic properties of those phonemes into sounds that can be played. It is a little trickier then the over-simplified explanation I have given but it demonstrates the idea. There are multiple uses such as: • Call centre technology • Mobile texting to landline phones - where the text message is translated into speech and transferred automatically to the designated landline connection. A working product developed by Loquendo (a subsidiary of Italia Telecom) 6.4 Chatterbots (Understander-Systems) Things have come a long way from the early days of the Eliza in the 60’s and Michael Mauldin from Carnegie Mellon University who coined the term “Chatterbot” in 1994. Basic Chatterbots such as Eliza, Alice and Brian3 use a process of pattern recognition to analyse text and create an illusion of understanding. Questions posed by the user triggered by occurrences of keywords or phrases; activate a particular type of pre-determined response, usually a question, incorporating that phrase or keyword in the answer. 6.5 Anti Plagiarism tools Text Analysis and textual understanding is at the heart of most anti plagiarism tools. To take an example from LSBU’s own efforts in this area, Thomas Lancaster (my old Java lecturer!) has developed a number of tools, one of which is; “Text Analysis Tool (TAT), a system which presents a rolling representation of the stylistic properties of a submission so find areas that are likely to represent extra-corpal plagiarism.” [Lancaster 2002] This is based around syntactic similarities in two target texts. Some tools will use this core technique and provide visual representations of the results (VAST) while others will simply present the two offending articles in a way that allows the user to make a decision more easily on the material they have in front of them (SSS, TRANK). This is an obvious labour saving tool for teachers and lecturers…and a source of fear and loathing by students the world over! 3 Visit the AAAI website for an extensive listing of Chatterbots (see reference section) - 17 -
  • 18. Ben Addley IR&NLP Coursework P1 7. What the Future Holds (in my opinion) This is still an emerging and exciting area of Computer Science with research being undertaken to improve quality all the areas mentioned in the sections above. There is a real sense that there are still a number of advancements possible in automating a number of the process currently undertaken by humans. There is also a business requirement driving research and development. Regardless of whether we approve of it or not, we now live in a truly global village with a need for fast, efficient and above all accurate tools to assist in our international electronic trading environment. Below is a summary of some of the major technological advancements I predict to occur in the near to middle future. Some are logical progressions to the current technology available from commercial companies. Others are slightly more experimental in that may require a radical approach to the way we look at the problems before developing viable solutions. • We should start to see more enterprise solutions with Text Analysis at the core. There is a lot of development in areas of speech to text and translation, with the goal being combined technologies transparent from the user, with seamless input of foreign speech to English textual output. • Convergence of NLP and IR technologies with more efficient methods to search for information in other languages, understanding the inferences of our query and returning results in an intelligent way. • Although English has gradually been accepted as the lingua franca in most industries there is a real need for translation tools, especially those dedicated to textual translation and the associated fields of retrieving that information. The EU alone dealt with 1,416,817 pages of text translation relying on text analysis in 2003, a rise of 9.4% from 2002 [EU DGT Annual Activity Report 2003] and other large bodies (both Governmental and NGO’s) are increasing budgets in the area to cope. This will be a major growth area and as such technological developments will follow. • Email dialogue systems that take automatic input from mail servers, carries out a Text Analysis process, interrogates associated applications such as diary programs, address books etc, and outputs a response in the form of replies to original emails. An example would be meeting planners trying to arrange for everyone in a special interest group (SIG) to attend a quarterly meeting. One email is sent out by the chair suggesting dates, once received, each recipients computer carries out the process described above and returns an email to the chair with their availability. This may continue through a number of iterations before everyone is available, but the key thing is that the process has been automatic and independent of human interference. There are of course many more examples, however, these seem logical and doable considering the current level of knowledge and application of technology at the moment. - 18 -
  • 19. Ben Addley IR&NLP Coursework P1 8. Conclusions As users and producers of information, we have created a spiral of ever increasing quantities of stored data. With the amount of directly accessible data available to us today, we have had to find new ways in which to manage and review this overabundance of textual information. In this paper we have started to explore how a computer can be used to manage this problem through identifying the context and importance of textual input. We’ve looked at the applications of Text Analysis as part of NLP and within the field of Information Retrieval and explored possible convergence in these areas. The key areas and technologies presented above are a mere introduction and do not constitute an in depth view. By showing how they broadly connect with Text Analysis and providing further resources (see reference section) I hope that the reader will further research areas that directly interest them. It is obvious to me from the study carried out for this project that the areas of text analysis and textual understanding within AI is far from complete. Research and development continues in Universities, Government and Industry, trying to achieve better, more natural responses to the inputs we give machines. There are many approaches to this task and we’ve briefly looked at rule-based analysis of text and also example based machine learning as well as hybrids of the two. At the beginning of this coursework report I posed a question; does Text Analysis constitute actual understanding of the textual input or simply an electronic approximation of understanding? The answer, in my opinion is that advancements in the technology have come a long way in the past twenty years, but we have not developed, and are nowhere near developing machines that understand in the way we understand. The question we should really ask is; can we ever truly develop understanding in computers? I’ll conclude with the words of a philosopher rather then a scientist. “It is a very remarkable fact that there are none so depraved and stupid, without even excepting of idiots, that they cannot arrange different words together, forming of them a statement by which they make known their thoughts; while on the other hand, there is no animal, however perfect and fortunately circumstanced it may be, which can do the same…” René Descartes (1637) …What would he have made of mans attempts to program machines to do the same! - 19 -
  • 20. Ben Addley IR&NLP Coursework P1 9. References Eisenberg A. 2003 “Get Me Rewrite!” “Hold On, I’ll Pass You to the Computer.” Article from The New York Times December 25, 2003 Inderjeet M. 2001. Automatic Summarisation. John Benjamins Publishing Company, Amsterdam/Philadephia. Minsky M. L. 1968. Semantic Information Processing MIT Press Yule G. 1985 The Study Of Language Cambridge University Press 9.1 Web References o Avron Barr http://www.aaai.org/Library/Magazine/Vol01/01-01/vol01-01.html Last Updated: Spring 1981 Downloaded: 19, November 2004 A paper entitled “Natural Language Understanding” by Barr of Stanford University. Although published many years ago it still has a wealth of useful information, especially good is the history of NLP (page 2) - 20 -
  • 21. Ben Addley IR&NLP Coursework P1 o Duke State University http://www.duke.edu/~mccann/mwb/15semnet.htm Last Updated: Unknown Downloaded: 16, November 2004 o EAMT http://www.eamt.org/mt.html Last Updated: 3, June 2004 Downloaded: 20, November 2004 Home page of the European Association of Machine Translation, this site is a good guide and technical resource in the area of MT. o European Union Annual Activity Report 2003 http://europa.eu.int/geninfo/query/engine/search/query.pl(AAR 2003 - Report ONLY.doc) Last Updated: 1, April 2004 Downloaded: 21, November 2004 Report into EU translation activities – very dry but full of useful statistics. o HALLoGRAM Publishing http://www.hallogram.com/textanalyst/ Last Updated: Unknown Downloaded: 7, November 2004 A commercial site promoting a product called textanalyst. Has numerous information and definition pages. Very good for an introduction to the subject. o Dave Inman http://www.scism.sbu.ac.uk/inmandw/tutorials/nlp/ambiguity/ambiguity.html Last Updated: 25, February 1997 Downloaded: 21, November 2004 All pages within the NLP tutorial site are directly relevant to text analysis and the wider fields of IR and NLP. A very good place to start any research in the area. - 21 -
  • 22. Ben Addley IR&NLP Coursework P1 o Harold Klein http://www.textanalysis.info/terms.htm Last Updated: 19, May 2002 Downloaded: 30, October 2004 A really good general site housing a huge range of information on text analysis. An excellent place to start any research into the subject. o Thomas Lancaster http://www.radford.edu/~sigcse/DC01/participants/lancaster.html Last Updated: 2002 Downloaded: 20, November 2004 A former PhD student at LSBU, Thomas developed a number of tools in the area of anti plagiarism. For how this links in with TA in more detail either follow the above link or type his name and the term “anti plagiarism” into google. o Sergei Ananyan, Alexander Kharlamov http://www.megaputer.com/tech/wp/tm.php3#nav Last Updated: 2004 Downloaded: 7, November 2004 Automated Analysis of Natural Language Texts white paper offering all round useful information. 9.1.1 General Web Resources These links act as an additional resource on the subject of Text Analysis within Information Retrieval. Most were used as general background reading for this coursework, some were quoted above and some were not used at all. http://www.textanalysis.com/ - VisualText™ - part of Text Analysis International, Inc. • Incorporated company providing products in Info Extraction and NLP. Very useful for the FAQ section: http://www.textanalysis.com/FAQs/faqs.html and the two main product sections that give an overview of current technology capabilities: http://www.textanalysis.com/Products/products.html http://www.intext.de/eindex.html - Social Science Consulting (English language) • Very good site dedicated to all things TA. Has a very good history section and applications for this technology. - 22 -
  • 23. Ben Addley IR&NLP Coursework P1 http://www.semantic-knowledge.com/tropes.htm - Semantic Knowledge Product Site • Another product on the Market, this time from Semantic Knowledge. Tropes offers TA plus IR technologies as a joined up product. Not much in the way of background reading. There is some useful info on how the engine works though: http://www.semantic- knowledge.com/fonction.htm http://www.megaputer.com/tech/wp/tm.php3 - • White paper from Megaputer™ (company offering TextAnalysis product). Good background and introduction to the subject. Very useful section on the history of the subject: http://www.megaputer.com/tech/wp/tm.php3#history and new opportunities looked at from the business perspective: http://www.scism.lsbu.ac.uk/inmandw/tutorials/nlp/index.html • Paper by Dave Inman on the complexities and possibilities of NLP via Computers. Not specifically aimed at my coursework area but very useful questions raised and will be useful to add context to TA within the wider field of IR. The two key links off this page are “can computers understand language?” and “does the structure of language help NLP?” http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/0.html • What appears to be a very useful repository on all things NLP from Carnegie Mellon University. It is fairly out of date stuff but good for general background http://www.loquendo.it/en/index.htm • English homepage of Loquendo, an Italian company providing application tools involving speech to text and vice versa. 9.1.2 General Book Resources Obviously as well as the above links there is also a good grounding to be found in the core text book, FOA: A cognitive perspective on search engine technology and the www (Belew R. K). As well as this book there is also the secondary textbook, Information Retrieval (Van Rijsbergen C. J.). This second book is available on the www and has a specific section on Automatic Text Analysis: http://www.dcs.gla.ac.uk/Keith/Preface.html - 23 -