SlideShare uma empresa Scribd logo
1 de 58
Automatic Text Summarization:
A Solid Base
Martijn B. Wieling,
Rijksuniversiteit Groningen
November, 25th 2004
ATS: A Solid Base
Outline
• Why should we bother at all? (a.k.a. Introduction)
• A frequency based ATS [Luhn, 1958]
• An ATS based on multiple features [Edmundson, 1969]
• Automatically combining the features (1) [Kupiec et al, 1995]
• Automatically combining the features (2) [Teufel & Moens, 1997]
• Why should we still bother? (a.k.a. Conclusion)
0000001
ATS: A Solid Base
Why should we bother at all?
• Time saving
• Large scale application possible, e.g.
– ‘Google-xtract’
– Extract translation
• Abstracts will be consistent and objective
0000010
ATS: A Solid Base
And in the beginning there was …
• Hans Peter Luhn (“father of Information Retrieval”):
The Automatic Creation of Literature Abstracts - 1958
Image: Courtesy IBM
0000011
ATS: A Solid Base
Luhn’s method: basic idea
• Target documents: technical literature
• The method is based on the following assumptions:
– Frequency of word occurrence in an article is a useful measurement of word
significance
– Relative position of these significant words within a sentence is also a useful
measurement of word significance
• Based on limited capabilities of machines (IBM 704)  no semantic
information
IBM 704 - Courtesy IBM
0000100
ATS: A Solid Base
Why word frequency?
• Important words are repeated throughout the text
– examples are given in favor of a certain principle
– arguments are given for a certain principle
– Technical literature  one word: one notion
• Simple and straightforward algorithm  cheap to implement
(processing time is costly)
– Note that different forms of the same word are counted as the same word
0000101
ATS: A Solid Base
When significant?
• Too low frequent words are not significant
• Too high frequent words are also not significant (e.g. “the”, “and”)
• Removing low frequent words is easy
– set a minimum frequency-threshold
• Removing common (high frequent) words:
– Setting a maximum frequency threshold (statistically obtained)
– Comparing to a common-word list
Figure 1 from [Luhn, 1958]
0000110
ATS: A Solid Base
Using relative position
• Where greatest number of high-frequent words are found closest
together  probability very high that representative information is
given
• Based on the characteristic that an explanation of a certain idea is
represented by words closely together (e.g. sentences – paragraphs
- chapters)
0000111
ATS: A Solid Base
The significance factor
• The “significance factor” of a sentence reflects the number of
occurrences of significant words within a sentence and the linear
distance between them due to non-significant words in between
• Only consider portion of sentence bracketed by significant words
with maximum of 5 non-significant words in between,
e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “
• Significance factor formula: (Σ[*])2 / |[.]|
(2.5 in the above example)
0001000
ATS: A Solid Base
Generating the abstract
• For every sentence the significance factor is calculated
• The sentences with a significance factor higher than a certain cut-off
value are returned (alternatively the N highest-valued sentences can
be returned)
• For large texts, it can also be applied to subdivisions of the text
• No evaluation of the results present in the journal paper!
0001001
ATS: A Solid Base
A new method by Edmundson
• H.P. Edmundson:
New methods in Automatic Extracting - 1969
0001010
IBM 7090 - Courtesy IBM
ATS: A Solid Base
Four methods for weighting
• Weighting methods:
– Cue Method
– Key Method
– Title Method
– Location Method
• The weight of a sentence is a linear combination of the weights
obtained with the above four methods
• The highest weighing sentences are included in the abstract
• Target documents: technical literature
0001011
ATS: A Solid Base
Cue Method
• Based on the hypothesis that the probable relevance of a sentence
is affected by presence of pragmatic words (e.g. “Significant”,
“Greatest”, Impossible”, “Hardly”)
• Three types of Cue words:
– Bonus words: positively affecting the relevance of a sentence (e.g. “Significant”,
“Greatest”)
– Stigma words: negatively affecting the relevance of a sentence (e.g.
“Impossible”, “Hardly”)
– Null words: irrelevant
0001100
ATS: A Solid Base
Obtaining Cue words
• The lists were obtained by statistical analyses of 100 documents:
– Dispersion (λ): number of documents in which the word occurred
– Selection ratio (η): ratio of number of occurrences in extractor-selected
sentences to number of occurrences in all sentences
• Bonus words: η > thighη
• Stigma words: η < tlowη
• Null words: λ > tλ and tlowη< η < thighη
0001101
ATS: A Solid Base
Resulting Cue lists
• Bonus list (783): comparatives, superlatives, adverbs of conclusion,
value terms, etc.
• Stigma list (73): anaphoric expressions, belittling expressions, etc.
• Null list (139): ordinals, cardinals, the verb “to be”, prepositions,
pronouns, etc.
0001110
ATS: A Solid Base
Cue weight of sentence
• Tag all Bonus words with weight b > 0, all Stigma words with weight
s < 0, all Null words with weight n = 0
• Cue weight of sentence: Σ (Cue weight of each word in sentence)
0001111
ATS: A Solid Base
Key Method
• Principle based on [Luhn], counting the frequency of words.
• Algorithm differs:
– Create key glossary of all non-Cue words in the document which have a
frequency larger than a certain threshold
– Weight of each key word in the key glossary is set to the frequency it occurs in
the document
– Assign key weight to each word which can be found in the key glossary
– If word is not in key glossary, key weight: 0
– No relative position is used ([Luhn])
• Key weight of sentence: Σ (Key weight of each word in sentence)
0010000
ATS: A Solid Base
Title Method
• Based on the hypothesis that an author conceives title as
circumscribing the subject matter of the document (similarly for
headings vs. paragraphs)
• Create title glossary consisting of all non-Null words in the title,
subtitle and headings of the document
• Words are given a positive title weight if they appear in this glossary
• Title words are given a larger weight than heading words
• Title weight of sentence: Σ (Title weight of each word in sentence)
0010001
ATS: A Solid Base
Location Method
• Based on the hypothesis that:
– Sentences occurring under certain headings are positively relevant
– Topic sentences tend to occur very early or very late in a document and its
paragraphs
• Global idea:
– Give each sentence below his heading the same weight as the heading itself
(note that this is independent from the Title Method) – Heading weight
– Give each sentence a certain weight based on its position - Ordinal weight
– Location weight of sentence:
Ordinal weight of sentence + Heading weight of sentence
0010010
ATS: A Solid Base
Location Method: Heading weight
• Compare each word in a heading with the pre-stored Heading
dictionary
• If the word occurs in this dictionary, assign it a weight equal to the
weight it has in the dictionary
• Heading weight of a heading: Σ (heading weight of each word in
heading)
• Heading weight of a sentence = Heading weight of its heading
0010011
ATS: A Solid Base
Creating the Heading dictionary
• The Heading dictionary was created by listing all words in the
headings of 120 documents and calculating the selection ratio for
each word:
– Selection ratio (η): ratio of number of occurrences in extractor-selected
sentences to number of occurrences in all headings
• Deletions from this list were made on the basis of low frequency and
unrelatedness to the desired information types (subject, purpose,
conclusion, etc.)
• Weights were given to the words in the Heading dictionary
proportional to the selection ratio
• The resulting Heading dictionary contained 90 words
0010100
ATS: A Solid Base
Location Method: Ordinal weight
• Sentences of the first paragraph are tagged with weight O1
• Sentences of the last paragraph are tagged with weight O2
• The first sentence of a paragraph is tagged with weight O3
• The last sentence of a paragraph is tagged with weight O4
• Ordinal weight of sentence: O1 + O2 + O3 + O4
0010101
ATS: A Solid Base
Generating the abstract
• Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d
constant positive integers, C: Cue Weight, K: Key weight, T: Title
weight, L: Location weight
• The values of a, b, c and d were obtained by manually comparing
the generated automatic abstracts with the desired (human made)
abstract
• Return the highest N sentences under their proper headings as the
abstract (including title)
– N is calculated by taking a percentage of the size of the original documents, in
this journal paper 25% is used
0010110
ATS: A Solid Base
Which combination is best?
• All combinations of C, K, T and L were tried to see which result had (on
average) the most overlap with the handmade extract
• As can be seen in the figure below (only the interesting results are shown),
the Key method was omitted and only C, T and L are used to create the
best abstract
• Surprising result! (Luhn used only keywords to create the abstract)
Figure 4 from [Edmundson, 1969]
0010111
ATS: A Solid Base
Evaluation
• Evaluation was done on unseen data (40 technical documents),
comparison with handmade abstracts
– Result: 44% of the sentences co-selected, 66% similarity between abstracts
(human judge)
– Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between
abstracts
• Another evaluation criterion: ‘extract-worthiness’
– Result: 84% of the sentences selected is extract-worthy
– Therefore: for one document many possible abstracts (differing in length and
content)
0011000
ATS: A Solid Base
Comments
• [Goldstein e.a., 1999]:
Not good to base length of abstract on length of document
– Summary length is independent of document length
– The longer the document, the smaller the compression ratio ( |doc.| / |abstract| )
– Better to use constant summary length
• [Rath e.a., 1961]
Human selection of sentences in abstracts is very variable
– 6 abstracts of 20 sentences: only 32% overlap between 5 subjects (6: 8%)
– Abstracting the same document 2 times by the same person with 8 weeks in between:
only 55% overlap (average for 6 subjects)
• Perhaps the Key Method algorithm used here is not that good (Luhn’s
algorithm could be better)
0011001
ATS: A Solid Base
Time and cost of this system 
• Speed of extracting: 7800 words/minute
• Cost: $ 0,015 / word
– Including keypunching costs: $ 0.01 / word
– Used corpus of 29,500 words  $ 442.50 total cost
– CPI 2003: $ 2798.00 total cost
0011010
ATS: A Solid Base
A jump in time
• 1969: First man on the moon
• 1972: Watergate scandal
• 1980: John Lennon killed
• 1981: First identification of AIDS & Birth of me 
• 1986: Space Shuttle Challenger explodes after launch
• 1989: Fall of Berlin Wall
• 1990: Start Gulf War & Introduction WWW
• 1991: Soviet Union breaks up
• 1992: Formal end of Cold War
• 1993: Creation of European Union (“Verdrag van Maastricht”)
• 1994: Nelson Mandela president of South Africa
0011011
ATS: A Solid Base
1995: Trained summarization
• Julian Kupiec, Jan Pedersen and Francine Chen:
A Trainable Document Summarizer - 1995
0011100
ATS: A Solid Base
Trained weighting
• Edmundson used subjective weighting of the features (Cue, Key,
Title, Location) to create an abstract
• In this journal paper generating the abstract is approached as a
statistical classification problem
– Given a training set of documents with handmade abstracts:
– Develop a classification function that estimates the probability a given sentence
is included in the abstract
• This requires a training corpus of documents with abstracts
• Target documents: technical literature
0011101
ATS: A Solid Base
Features
• Five features were used:
– Sentence Length Cut-off Feature
– Fixed Phrase Feature
– Paragraph Feature
– Thematic Word Feature
– Uppercase Word Feature
• The above features were chosen by experimentation
0011110
ATS: A Solid Base
Sentence Length Cut-off Feature
• Based on the principle that short sentences are often not included in
abstracts
• Given a threshold (e.g. 5 words):
– SLC-value is true for sentences longer than the threshold
– SLC-value is false otherwise
• Note that this feature is not similar to any of the features
Edmundson used
0011111
ATS: A Solid Base
Fixed-Phrase Feature
• Based on the hypothesis that:
– sentences containing any of a list of fixed phrases (mostly 2 words long) are
likely to be in the abstract (e.g. “in conclusion”, “this result” – total: 26 elements)
– Sentences following a heading containing a certain keyword are more likely to be
in the abstract (e.g., “conclusions”, “results”, “summary”)
• FP-value is true for sentences in the above situations, false
otherwise
• Note that this feature is a combination of Edmundson’s Location
Method and Cue Method, though in reduced form
0100000
ATS: A Solid Base
Paragraph Feature
• Each sentence in the first ten and last five paragraphs is tagged
based on it’s location
– Paragraph-initial
– Paragraph-final (|P| > 1 sentence)
– Paragraph-medial (|P| > 2 sentences)
• Note that this feature is a reduced form of Edmundson’s Location
Method
0100001
ATS: A Solid Base
Thematic Word Feature
• The most frequent words in a document are defined as thematic
words
• A small number of thematic words is selected and each sentence is
scored as a function of frequency of these thematic words
• TW-value is true if it is one of the highest scoring sentences
• TW-value is false otherwise
• Note that this feature is an adapted version of Edmundson’s Key
Method
0100010
ATS: A Solid Base
Uppercase Word Feature
• Based on the hypothesis that proper names often are important, since it is
the explanatory text for acronyms (e.g. “… the ISO (International Standards
Organization) …”)
• Count the frequency of each proper name
– Constraint: the uppercase thematic word is not sentence initial and begins with a capital letter
– The word must occur several times and may not be an abbreviated measurement unit
• Score each sentence based on the number of frequent proper names in each
sentence
– The score of a sentence in which the frequent proper name appears first is twice as high as
later occurrences
• UW-value is true if it is one of the highest scoring sentences, false otherwise
• Note that this feature is a bit similar to Edmundson’s Key Method
0100011
ATS: A Solid Base
Classification
• For each sentence s the probability P is calculated that it will be
included in the summary S given the k features (Bayes’ rule):
• Assuming statistical independence of the features:
• is constant, and and can be estimated
directly from the training set by counting occurrences
• This function assigns for each s a score which can be used to select
sentences for inclusion in the abstract
0100100
ATS: A Solid Base
The training material
• 188 documents with professionally created abstracts from the
scientific/technical domain, the average length of the abstracts is 3
sentences (3.5% of the total size of the document)
• Sentences from the abstract were matched to the original document:
– 79% direct sentence matches
– 3% direct joins (2 sentences combined)
– 18% no direct match or join possible
• Therefore the maximum performance of the automatic system is
82%
0100101
ATS: A Solid Base
Evaluation (1)
• Too little material  Cross-validation used to evaluate
• Two evaluation measures
– Fraction of manually selected sentences which were reproduced correctly:
average result: 35%
– Fraction of the matchable selected sentences which were reproduced correctly:
average result: 42%
• Performance of features (2nd measure):
0100110
Feature Individual %
sentences
correct
Cumulative %
sentences
correct
Paragraph 33 33
Fixed Phrases 29 42
Length Cut-off 24 44
Thematic Word 20 42
Uppercase Word 20 42
ATS: A Solid Base
Evaluation (2)
• Best combination is: Paragraph + Fixed Phrase + Length Cut-off
(44% performance)
• Addition of frequency keyword features results in a slight decrease
of performance (44%  42%)
– Note that Edmundson in this case also reports a decrease in performance
• In final implementation frequency keyword features are retained in
favor of robustness
• Baseline used in this experiment: Selecting N sentences from the
beginning (Length Cut-off, thus positively biased)
• Full feature set has an improvement of 74% over baseline (24% 
42%)
0100111
ATS: A Solid Base
Evaluation (3)
• If the size of the generated abstract is increased to 25%, the
performance improves to 84%
• Edmundson ‘only’ had a performance of 44%
0101000
ATS: A Solid Base
Comments
• The features used in this paper were chosen by experimentation
– No results/discussions of these experiments are given in the paper, so the reason
for the choices remain unclear…
• The comparison to Edmundson is not very fair
– Handmade reference abstracts of Edmundson had a size of 25% (here 3.5%)
• Also the comments which were given about [Edmundson] apply here:
– Not good to base length of abstract on length of document
– Human selection of sentences in abstracts is very variable
– Perhaps the Key Method algorithm used here is too simple (Luhn’s algorithm could
be better)
0101001
ATS: A Solid Base
Revisited: [Kupiec e.a., 1995]
• Simone Teufel and Marc Moens:
Sentence extraction as a classification task - 1997
0101010
ATS: A Solid Base
Main research questions
• Could Kupiec e.a.’s methodology (training a model with a corpus) be
used for another evaluation criterion?
• What was the difference in extracting performance of both
evaluation criterions for different types of documents?
• Note that another set of features is used here than Kupiec e.a. used
0101011
ATS: A Solid Base
Another evaluation method
• Kupiec e.a. used the ‘match sentences’ evaluation criterion
• Here the training and test set abstracts are created by the authors
themselves (as opposed to Kupiec e.a.)
• Hence less alignable sentences are available in the document
– 32% on average vs. 79% in Kupiec e.a.
• This does not mean there are less ‘extract-worthy’ sentences in the
document  another evaluation method is chosen
• Evaluation: ask human to identify abstract-worthy non-matchable
sentences in the original document
0101100
ATS: A Solid Base
Features
• The features used here are different from Kupiec e.a.
– Cue Phrase Method (1670 cue phrases):
– Location Method
– Sentence Length Method
– Thematic Word Method
– Title Method
0101101
ATS: A Solid Base
Cue Phrase Method
• Similarly as in Edmundson, with some differences:
– A 5-point scale (-1 … +3) is used instead of 3 (Bonus, Null, Stigma)
– Cue phrases are used instead of Cue words
– If a phrase was entered into the list, also syntactically and semantically similar
phrases were manually included in the list
– A sentence gets the score of it’s maximum-scored Cue phrase, if no Cue phrases
are present it gets a score of 0
• The list was manually created by inspecting extracted sentences
– Also based on relative frequency in abstract and relative frequency in document
• Sentences occurring directly after headings like ‘Introduction’ or
‘Conclusion’ are given a prior score of +2 (in Edmundson this is part
of the Location Method)
0101110
ATS: A Solid Base
Location Method
• As in Edmundson, with the exception of the sentences directly after
headings previously mentioned
• Sensitive for certain headings (e.g. “Introduction”); if such headings
cannot be found: only the sentences of the first 7 and last 3
paragraphs are tagged (initial, medial, final)
0101111
ATS: A Solid Base
Sentence Length Method
• As in Kupiec e.a.
• The threshold is set to 15 tokens (including punctuation)
0110000
ATS: A Solid Base
Thematic Word Method
• As in Kupiec e.a., with a few differences:
• Selecting (non-Cue) words which occur frequently in this document,
but rarely in the overall collection of documents
• For each (non-Cue) word the term-frequency*inverse-document-
frequency value is calculated:
• score(w) = floc * log (100*N / fglob)
– with N: total number of documents, floc: frequency of word w in document,
fglob: number of documents containing word w
• Top 10 scoring words are defined as thematic words
• Top 40 sentences based on the frequency of thematic words
(meaned by sentence length) are given a TW-value of 1, all others 0
0110001
ATS: A Solid Base
Title Method
• As in Edmundson, with the difference that:
• The Title score of the sentence is the mean frequency of Title word
occurrences in the sentence (in Edmundson each Title word was
given the same score and the scores were summed)
• Headings are not taken into account here (by experimentation)
• The 18 top-scoring sentences receive a Title-value of 1, the others 0
0110010
ATS: A Solid Base
The experiment
• Training set: a corpus of 124 documents from different areas of
computational linguistics with summaries written by the authors
• A human judge marked additional abstract-worthy sentences in each
document
• 32% alignable sentences in the abstracts
• Two evaluation methods (‘alignable’ and ‘abstract-worthy’) which
were also combined
0110011
ATS: A Solid Base
Summary of results
• Baseline: 28% (obtained in a similar fashion as Kupiec e.a.)
• Bad performance of 31.6% for alignability can be explained because there
are less alignable sentences to train on
• Short abstracts were generated (2 – 5% of size original document)
• If abstract size would be increased to 25%, performance would increase to:
– ‘Alignability’: 96% (Kupiec e.a.: 84%)
– ‘Abstract-worthy’: 98%
– Combined: 97.3%
• Therefore compression makes the difference, not the evaluation criterion
‘Alignability’ ‘Abstract-worthy’ Combined
Best single feature: Cue Method 23.2% 46.7% 55.2%
All features 31.6% 57.2% 68.4%
0110100
ATS: A Solid Base
Conclusions of this experiment
• The method proposed by Kupiec e.a. of classificatory sentence
selection is not restricted to texts which have high-quality handmade
abstracts
• A higher alignability of the handmade abstract is therefore not
necessary for the purpose of sentence extraction – compression
rate is the factor which influences the result
• However, if more flexible abstracts should be generated, the addition
of other training and evaluation criterions is useful
• Increased training did not improve results, improvement can be
obtained in the extraction methods themselves
0110101
ATS: A Solid Base
Comments
• The features used in this paper were different from Kupiec e.a.
– No motivation was given why for instance the Uppercase Word feature was
omitted, and why adapted versions of Edmundson were chosen instead of the
versions Kupiec e.a. used
• Also comments which were given about [Edmundson] apply here:
– Not good to base length of abstract on length of document
– Human selection of ‘abstract-worthy’ sentences in abstracts is very variable
0110110
ATS: A Solid Base
Why should we still bother …
• In the discussed methods no attention is given to:
– Cohesion of the abstract: filtering anaphors out of an abstract (e.g. ‘it’, ‘that’)
– Filtering out repetition in the abstract
– The semantics of the document
• Cohesion: an attempt is made by using Lexical Chains
• Repetition: an attempt is made by using Maximum Marginal Relevance
• Semantics: this can still not be done for the general case, but an attempt is
made by using Rhetorical Tree Structures
• Interested about these problems?
• Wicher will explain extraction methods which will address repetition and
semantics problems in his presentation
• Terrence will explain Lexical Chains in his presentation
0110111
ATS: A Solid Base
References
• The Automatic Creation of Literature Abstracts, H.P. Luhn, 1958
• New Methods in Automatic Extracting, H.P. Edmundson, 1969
• A Trainable Document Summarizer, J. Kupiec e.a., 1995
• Sentence Extraction as a Classification Task, S. Teufel and M. Moens, 1997
• The Formation of Abstracts by the Selection of Sentences, G.J. Rath e.a., 1961
• Constructing Literature Abstracts by Computer: Techniques and Prospects, C.D.
Paice, 1990
• Summarizing Text Documents: Sentence Selection and Evaluation Metrics, Goldstein
e.a., 1999
0111000
ATS: A Solid Base
Any questions?
0111001

Mais conteúdo relacionado

Último

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 

Último (20)

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Information retrieval is the process of accessing data resources. Usually documents or other unstructured data for the purpose of sharing knowledge.

  • 1. Automatic Text Summarization: A Solid Base Martijn B. Wieling, Rijksuniversiteit Groningen November, 25th 2004
  • 2. ATS: A Solid Base Outline • Why should we bother at all? (a.k.a. Introduction) • A frequency based ATS [Luhn, 1958] • An ATS based on multiple features [Edmundson, 1969] • Automatically combining the features (1) [Kupiec et al, 1995] • Automatically combining the features (2) [Teufel & Moens, 1997] • Why should we still bother? (a.k.a. Conclusion) 0000001
  • 3. ATS: A Solid Base Why should we bother at all? • Time saving • Large scale application possible, e.g. – ‘Google-xtract’ – Extract translation • Abstracts will be consistent and objective 0000010
  • 4. ATS: A Solid Base And in the beginning there was … • Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts - 1958 Image: Courtesy IBM 0000011
  • 5. ATS: A Solid Base Luhn’s method: basic idea • Target documents: technical literature • The method is based on the following assumptions: – Frequency of word occurrence in an article is a useful measurement of word significance – Relative position of these significant words within a sentence is also a useful measurement of word significance • Based on limited capabilities of machines (IBM 704)  no semantic information IBM 704 - Courtesy IBM 0000100
  • 6. ATS: A Solid Base Why word frequency? • Important words are repeated throughout the text – examples are given in favor of a certain principle – arguments are given for a certain principle – Technical literature  one word: one notion • Simple and straightforward algorithm  cheap to implement (processing time is costly) – Note that different forms of the same word are counted as the same word 0000101
  • 7. ATS: A Solid Base When significant? • Too low frequent words are not significant • Too high frequent words are also not significant (e.g. “the”, “and”) • Removing low frequent words is easy – set a minimum frequency-threshold • Removing common (high frequent) words: – Setting a maximum frequency threshold (statistically obtained) – Comparing to a common-word list Figure 1 from [Luhn, 1958] 0000110
  • 8. ATS: A Solid Base Using relative position • Where greatest number of high-frequent words are found closest together  probability very high that representative information is given • Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters) 0000111
  • 9. ATS: A Solid Base The significance factor • The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between • Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “ • Significance factor formula: (Σ[*])2 / |[.]| (2.5 in the above example) 0001000
  • 10. ATS: A Solid Base Generating the abstract • For every sentence the significance factor is calculated • The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned) • For large texts, it can also be applied to subdivisions of the text • No evaluation of the results present in the journal paper! 0001001
  • 11. ATS: A Solid Base A new method by Edmundson • H.P. Edmundson: New methods in Automatic Extracting - 1969 0001010 IBM 7090 - Courtesy IBM
  • 12. ATS: A Solid Base Four methods for weighting • Weighting methods: – Cue Method – Key Method – Title Method – Location Method • The weight of a sentence is a linear combination of the weights obtained with the above four methods • The highest weighing sentences are included in the abstract • Target documents: technical literature 0001011
  • 13. ATS: A Solid Base Cue Method • Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”) • Three types of Cue words: – Bonus words: positively affecting the relevance of a sentence (e.g. “Significant”, “Greatest”) – Stigma words: negatively affecting the relevance of a sentence (e.g. “Impossible”, “Hardly”) – Null words: irrelevant 0001100
  • 14. ATS: A Solid Base Obtaining Cue words • The lists were obtained by statistical analyses of 100 documents: – Dispersion (λ): number of documents in which the word occurred – Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all sentences • Bonus words: η > thighη • Stigma words: η < tlowη • Null words: λ > tλ and tlowη< η < thighη 0001101
  • 15. ATS: A Solid Base Resulting Cue lists • Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc. • Stigma list (73): anaphoric expressions, belittling expressions, etc. • Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc. 0001110
  • 16. ATS: A Solid Base Cue weight of sentence • Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0 • Cue weight of sentence: Σ (Cue weight of each word in sentence) 0001111
  • 17. ATS: A Solid Base Key Method • Principle based on [Luhn], counting the frequency of words. • Algorithm differs: – Create key glossary of all non-Cue words in the document which have a frequency larger than a certain threshold – Weight of each key word in the key glossary is set to the frequency it occurs in the document – Assign key weight to each word which can be found in the key glossary – If word is not in key glossary, key weight: 0 – No relative position is used ([Luhn]) • Key weight of sentence: Σ (Key weight of each word in sentence) 0010000
  • 18. ATS: A Solid Base Title Method • Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs) • Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document • Words are given a positive title weight if they appear in this glossary • Title words are given a larger weight than heading words • Title weight of sentence: Σ (Title weight of each word in sentence) 0010001
  • 19. ATS: A Solid Base Location Method • Based on the hypothesis that: – Sentences occurring under certain headings are positively relevant – Topic sentences tend to occur very early or very late in a document and its paragraphs • Global idea: – Give each sentence below his heading the same weight as the heading itself (note that this is independent from the Title Method) – Heading weight – Give each sentence a certain weight based on its position - Ordinal weight – Location weight of sentence: Ordinal weight of sentence + Heading weight of sentence 0010010
  • 20. ATS: A Solid Base Location Method: Heading weight • Compare each word in a heading with the pre-stored Heading dictionary • If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary • Heading weight of a heading: Σ (heading weight of each word in heading) • Heading weight of a sentence = Heading weight of its heading 0010011
  • 21. ATS: A Solid Base Creating the Heading dictionary • The Heading dictionary was created by listing all words in the headings of 120 documents and calculating the selection ratio for each word: – Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all headings • Deletions from this list were made on the basis of low frequency and unrelatedness to the desired information types (subject, purpose, conclusion, etc.) • Weights were given to the words in the Heading dictionary proportional to the selection ratio • The resulting Heading dictionary contained 90 words 0010100
  • 22. ATS: A Solid Base Location Method: Ordinal weight • Sentences of the first paragraph are tagged with weight O1 • Sentences of the last paragraph are tagged with weight O2 • The first sentence of a paragraph is tagged with weight O3 • The last sentence of a paragraph is tagged with weight O4 • Ordinal weight of sentence: O1 + O2 + O3 + O4 0010101
  • 23. ATS: A Solid Base Generating the abstract • Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d constant positive integers, C: Cue Weight, K: Key weight, T: Title weight, L: Location weight • The values of a, b, c and d were obtained by manually comparing the generated automatic abstracts with the desired (human made) abstract • Return the highest N sentences under their proper headings as the abstract (including title) – N is calculated by taking a percentage of the size of the original documents, in this journal paper 25% is used 0010110
  • 24. ATS: A Solid Base Which combination is best? • All combinations of C, K, T and L were tried to see which result had (on average) the most overlap with the handmade extract • As can be seen in the figure below (only the interesting results are shown), the Key method was omitted and only C, T and L are used to create the best abstract • Surprising result! (Luhn used only keywords to create the abstract) Figure 4 from [Edmundson, 1969] 0010111
  • 25. ATS: A Solid Base Evaluation • Evaluation was done on unseen data (40 technical documents), comparison with handmade abstracts – Result: 44% of the sentences co-selected, 66% similarity between abstracts (human judge) – Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between abstracts • Another evaluation criterion: ‘extract-worthiness’ – Result: 84% of the sentences selected is extract-worthy – Therefore: for one document many possible abstracts (differing in length and content) 0011000
  • 26. ATS: A Solid Base Comments • [Goldstein e.a., 1999]: Not good to base length of abstract on length of document – Summary length is independent of document length – The longer the document, the smaller the compression ratio ( |doc.| / |abstract| ) – Better to use constant summary length • [Rath e.a., 1961] Human selection of sentences in abstracts is very variable – 6 abstracts of 20 sentences: only 32% overlap between 5 subjects (6: 8%) – Abstracting the same document 2 times by the same person with 8 weeks in between: only 55% overlap (average for 6 subjects) • Perhaps the Key Method algorithm used here is not that good (Luhn’s algorithm could be better) 0011001
  • 27. ATS: A Solid Base Time and cost of this system  • Speed of extracting: 7800 words/minute • Cost: $ 0,015 / word – Including keypunching costs: $ 0.01 / word – Used corpus of 29,500 words  $ 442.50 total cost – CPI 2003: $ 2798.00 total cost 0011010
  • 28. ATS: A Solid Base A jump in time • 1969: First man on the moon • 1972: Watergate scandal • 1980: John Lennon killed • 1981: First identification of AIDS & Birth of me  • 1986: Space Shuttle Challenger explodes after launch • 1989: Fall of Berlin Wall • 1990: Start Gulf War & Introduction WWW • 1991: Soviet Union breaks up • 1992: Formal end of Cold War • 1993: Creation of European Union (“Verdrag van Maastricht”) • 1994: Nelson Mandela president of South Africa 0011011
  • 29. ATS: A Solid Base 1995: Trained summarization • Julian Kupiec, Jan Pedersen and Francine Chen: A Trainable Document Summarizer - 1995 0011100
  • 30. ATS: A Solid Base Trained weighting • Edmundson used subjective weighting of the features (Cue, Key, Title, Location) to create an abstract • In this journal paper generating the abstract is approached as a statistical classification problem – Given a training set of documents with handmade abstracts: – Develop a classification function that estimates the probability a given sentence is included in the abstract • This requires a training corpus of documents with abstracts • Target documents: technical literature 0011101
  • 31. ATS: A Solid Base Features • Five features were used: – Sentence Length Cut-off Feature – Fixed Phrase Feature – Paragraph Feature – Thematic Word Feature – Uppercase Word Feature • The above features were chosen by experimentation 0011110
  • 32. ATS: A Solid Base Sentence Length Cut-off Feature • Based on the principle that short sentences are often not included in abstracts • Given a threshold (e.g. 5 words): – SLC-value is true for sentences longer than the threshold – SLC-value is false otherwise • Note that this feature is not similar to any of the features Edmundson used 0011111
  • 33. ATS: A Solid Base Fixed-Phrase Feature • Based on the hypothesis that: – sentences containing any of a list of fixed phrases (mostly 2 words long) are likely to be in the abstract (e.g. “in conclusion”, “this result” – total: 26 elements) – Sentences following a heading containing a certain keyword are more likely to be in the abstract (e.g., “conclusions”, “results”, “summary”) • FP-value is true for sentences in the above situations, false otherwise • Note that this feature is a combination of Edmundson’s Location Method and Cue Method, though in reduced form 0100000
  • 34. ATS: A Solid Base Paragraph Feature • Each sentence in the first ten and last five paragraphs is tagged based on it’s location – Paragraph-initial – Paragraph-final (|P| > 1 sentence) – Paragraph-medial (|P| > 2 sentences) • Note that this feature is a reduced form of Edmundson’s Location Method 0100001
  • 35. ATS: A Solid Base Thematic Word Feature • The most frequent words in a document are defined as thematic words • A small number of thematic words is selected and each sentence is scored as a function of frequency of these thematic words • TW-value is true if it is one of the highest scoring sentences • TW-value is false otherwise • Note that this feature is an adapted version of Edmundson’s Key Method 0100010
  • 36. ATS: A Solid Base Uppercase Word Feature • Based on the hypothesis that proper names often are important, since it is the explanatory text for acronyms (e.g. “… the ISO (International Standards Organization) …”) • Count the frequency of each proper name – Constraint: the uppercase thematic word is not sentence initial and begins with a capital letter – The word must occur several times and may not be an abbreviated measurement unit • Score each sentence based on the number of frequent proper names in each sentence – The score of a sentence in which the frequent proper name appears first is twice as high as later occurrences • UW-value is true if it is one of the highest scoring sentences, false otherwise • Note that this feature is a bit similar to Edmundson’s Key Method 0100011
  • 37. ATS: A Solid Base Classification • For each sentence s the probability P is calculated that it will be included in the summary S given the k features (Bayes’ rule): • Assuming statistical independence of the features: • is constant, and and can be estimated directly from the training set by counting occurrences • This function assigns for each s a score which can be used to select sentences for inclusion in the abstract 0100100
  • 38. ATS: A Solid Base The training material • 188 documents with professionally created abstracts from the scientific/technical domain, the average length of the abstracts is 3 sentences (3.5% of the total size of the document) • Sentences from the abstract were matched to the original document: – 79% direct sentence matches – 3% direct joins (2 sentences combined) – 18% no direct match or join possible • Therefore the maximum performance of the automatic system is 82% 0100101
  • 39. ATS: A Solid Base Evaluation (1) • Too little material  Cross-validation used to evaluate • Two evaluation measures – Fraction of manually selected sentences which were reproduced correctly: average result: 35% – Fraction of the matchable selected sentences which were reproduced correctly: average result: 42% • Performance of features (2nd measure): 0100110 Feature Individual % sentences correct Cumulative % sentences correct Paragraph 33 33 Fixed Phrases 29 42 Length Cut-off 24 44 Thematic Word 20 42 Uppercase Word 20 42
  • 40. ATS: A Solid Base Evaluation (2) • Best combination is: Paragraph + Fixed Phrase + Length Cut-off (44% performance) • Addition of frequency keyword features results in a slight decrease of performance (44%  42%) – Note that Edmundson in this case also reports a decrease in performance • In final implementation frequency keyword features are retained in favor of robustness • Baseline used in this experiment: Selecting N sentences from the beginning (Length Cut-off, thus positively biased) • Full feature set has an improvement of 74% over baseline (24%  42%) 0100111
  • 41. ATS: A Solid Base Evaluation (3) • If the size of the generated abstract is increased to 25%, the performance improves to 84% • Edmundson ‘only’ had a performance of 44% 0101000
  • 42. ATS: A Solid Base Comments • The features used in this paper were chosen by experimentation – No results/discussions of these experiments are given in the paper, so the reason for the choices remain unclear… • The comparison to Edmundson is not very fair – Handmade reference abstracts of Edmundson had a size of 25% (here 3.5%) • Also the comments which were given about [Edmundson] apply here: – Not good to base length of abstract on length of document – Human selection of sentences in abstracts is very variable – Perhaps the Key Method algorithm used here is too simple (Luhn’s algorithm could be better) 0101001
  • 43. ATS: A Solid Base Revisited: [Kupiec e.a., 1995] • Simone Teufel and Marc Moens: Sentence extraction as a classification task - 1997 0101010
  • 44. ATS: A Solid Base Main research questions • Could Kupiec e.a.’s methodology (training a model with a corpus) be used for another evaluation criterion? • What was the difference in extracting performance of both evaluation criterions for different types of documents? • Note that another set of features is used here than Kupiec e.a. used 0101011
  • 45. ATS: A Solid Base Another evaluation method • Kupiec e.a. used the ‘match sentences’ evaluation criterion • Here the training and test set abstracts are created by the authors themselves (as opposed to Kupiec e.a.) • Hence less alignable sentences are available in the document – 32% on average vs. 79% in Kupiec e.a. • This does not mean there are less ‘extract-worthy’ sentences in the document  another evaluation method is chosen • Evaluation: ask human to identify abstract-worthy non-matchable sentences in the original document 0101100
  • 46. ATS: A Solid Base Features • The features used here are different from Kupiec e.a. – Cue Phrase Method (1670 cue phrases): – Location Method – Sentence Length Method – Thematic Word Method – Title Method 0101101
  • 47. ATS: A Solid Base Cue Phrase Method • Similarly as in Edmundson, with some differences: – A 5-point scale (-1 … +3) is used instead of 3 (Bonus, Null, Stigma) – Cue phrases are used instead of Cue words – If a phrase was entered into the list, also syntactically and semantically similar phrases were manually included in the list – A sentence gets the score of it’s maximum-scored Cue phrase, if no Cue phrases are present it gets a score of 0 • The list was manually created by inspecting extracted sentences – Also based on relative frequency in abstract and relative frequency in document • Sentences occurring directly after headings like ‘Introduction’ or ‘Conclusion’ are given a prior score of +2 (in Edmundson this is part of the Location Method) 0101110
  • 48. ATS: A Solid Base Location Method • As in Edmundson, with the exception of the sentences directly after headings previously mentioned • Sensitive for certain headings (e.g. “Introduction”); if such headings cannot be found: only the sentences of the first 7 and last 3 paragraphs are tagged (initial, medial, final) 0101111
  • 49. ATS: A Solid Base Sentence Length Method • As in Kupiec e.a. • The threshold is set to 15 tokens (including punctuation) 0110000
  • 50. ATS: A Solid Base Thematic Word Method • As in Kupiec e.a., with a few differences: • Selecting (non-Cue) words which occur frequently in this document, but rarely in the overall collection of documents • For each (non-Cue) word the term-frequency*inverse-document- frequency value is calculated: • score(w) = floc * log (100*N / fglob) – with N: total number of documents, floc: frequency of word w in document, fglob: number of documents containing word w • Top 10 scoring words are defined as thematic words • Top 40 sentences based on the frequency of thematic words (meaned by sentence length) are given a TW-value of 1, all others 0 0110001
  • 51. ATS: A Solid Base Title Method • As in Edmundson, with the difference that: • The Title score of the sentence is the mean frequency of Title word occurrences in the sentence (in Edmundson each Title word was given the same score and the scores were summed) • Headings are not taken into account here (by experimentation) • The 18 top-scoring sentences receive a Title-value of 1, the others 0 0110010
  • 52. ATS: A Solid Base The experiment • Training set: a corpus of 124 documents from different areas of computational linguistics with summaries written by the authors • A human judge marked additional abstract-worthy sentences in each document • 32% alignable sentences in the abstracts • Two evaluation methods (‘alignable’ and ‘abstract-worthy’) which were also combined 0110011
  • 53. ATS: A Solid Base Summary of results • Baseline: 28% (obtained in a similar fashion as Kupiec e.a.) • Bad performance of 31.6% for alignability can be explained because there are less alignable sentences to train on • Short abstracts were generated (2 – 5% of size original document) • If abstract size would be increased to 25%, performance would increase to: – ‘Alignability’: 96% (Kupiec e.a.: 84%) – ‘Abstract-worthy’: 98% – Combined: 97.3% • Therefore compression makes the difference, not the evaluation criterion ‘Alignability’ ‘Abstract-worthy’ Combined Best single feature: Cue Method 23.2% 46.7% 55.2% All features 31.6% 57.2% 68.4% 0110100
  • 54. ATS: A Solid Base Conclusions of this experiment • The method proposed by Kupiec e.a. of classificatory sentence selection is not restricted to texts which have high-quality handmade abstracts • A higher alignability of the handmade abstract is therefore not necessary for the purpose of sentence extraction – compression rate is the factor which influences the result • However, if more flexible abstracts should be generated, the addition of other training and evaluation criterions is useful • Increased training did not improve results, improvement can be obtained in the extraction methods themselves 0110101
  • 55. ATS: A Solid Base Comments • The features used in this paper were different from Kupiec e.a. – No motivation was given why for instance the Uppercase Word feature was omitted, and why adapted versions of Edmundson were chosen instead of the versions Kupiec e.a. used • Also comments which were given about [Edmundson] apply here: – Not good to base length of abstract on length of document – Human selection of ‘abstract-worthy’ sentences in abstracts is very variable 0110110
  • 56. ATS: A Solid Base Why should we still bother … • In the discussed methods no attention is given to: – Cohesion of the abstract: filtering anaphors out of an abstract (e.g. ‘it’, ‘that’) – Filtering out repetition in the abstract – The semantics of the document • Cohesion: an attempt is made by using Lexical Chains • Repetition: an attempt is made by using Maximum Marginal Relevance • Semantics: this can still not be done for the general case, but an attempt is made by using Rhetorical Tree Structures • Interested about these problems? • Wicher will explain extraction methods which will address repetition and semantics problems in his presentation • Terrence will explain Lexical Chains in his presentation 0110111
  • 57. ATS: A Solid Base References • The Automatic Creation of Literature Abstracts, H.P. Luhn, 1958 • New Methods in Automatic Extracting, H.P. Edmundson, 1969 • A Trainable Document Summarizer, J. Kupiec e.a., 1995 • Sentence Extraction as a Classification Task, S. Teufel and M. Moens, 1997 • The Formation of Abstracts by the Selection of Sentences, G.J. Rath e.a., 1961 • Constructing Literature Abstracts by Computer: Techniques and Prospects, C.D. Paice, 1990 • Summarizing Text Documents: Sentence Selection and Evaluation Metrics, Goldstein e.a., 1999 0111000
  • 58. ATS: A Solid Base Any questions? 0111001