1. Text Analytics World, Boston, October 3-4, 2012
Text Analytics on 2 Million
Documents: A Case Study
Plus, An Introduction into Keyword Extraction
Alyona Medelyan
2. What are these books about?
“Because he could” by D. Morris, E. McGann
“Still stripping after 25 years” by E. Burns
“Glut” by A. Wright
Only metadata will tell…
3. What this talk will cover:
• Who am I & my relation to the topic
• What types of keyword extraction are out there
• How does keyword extraction work
• How accurate can keywords be
• How to analyze 2 million documents efficiently
4. My Background
@zelandiya
medelyan.com
2005-2009 PhD Thesis on keyword extraction
“Human-competitive automatic topic indexing”
Maui
Multi-purpose
automatic topic indexing
nzdl.org/kea/ maui-indexer.googlecode.com
2010 co-organized keyword extraction competition
SemEval-2 SemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles”
2010-2012 leading the R&D of Pingar’s text analytics API
Pingar API features: keyword & named entities extraction, summarization etc.
5. Findability is ensured with the help of metadata
Document Easy to extract: Metadata
Title, file type & location,
creation & modification date,
authors, publisher
Difficult to extract:
Keywords & keyphrases,
people & companies mentioned,
suppliers & addresses mentioned
6. What can text analytics determine from text?
focus of this presentation
keywords text text text
tags
text text text
sentiment
text text text
text text text
text text text
text text text
genre
categories
taxonomy terms
entities
names biochemical
patterns … entities
text text text
text text text
text text text
text text text
text text text
text text text
7. Types of keyword extraction (or topic indexing)
• Subject headings in libraries
• general with Library of Congress Subject Headings
• domain-specific in PubMed with MeSH categories
taxonomy terms
controlled indexing
• Keyphrases in academic publications
keywords
tags
• Tags in folksonomies
• by authors on Technorati
• by users on Del.icio.us
free indexing
8. Free indexing Controlled indexing
E.g. keywords, tags E.g. LCSH, ACM, MeSH
Inconsistent Restricted
No control Centrally controlled
No semantics Inflexible
Ad hoc Not always available
9. How keyword extraction works
Document Candidates Keywords
1. Extract phrases using the sliding window approach
NEJM usually has the highest impact factor of the journals of clinical medicine.
ignore Alternative approach:
stopwords a) Assign part-of-speech tags
b) Extract valid noun phrases (NPs)
NEJM
highest, highest impact, highest impact factor
impact, impact factor…
10. How keyword extraction works
Document Candidates Keywords
2. Normalize phrases (case folding, stemming etc.)
NEJM usually has the highest impact factor of the journals of clinical medicine.
NEJM nejm New England J of Med
highest high -
highest impact factor high impact factor -
impact impact -
impact factor impact factor Impact Factor
journals journal Journal
journals of clinical journal of clinic -
clinical clinic Clinic
clinical medicine clinic medic Medicine
medicine medic Medicine
11. How keyword extraction works
Document Candidates Properties Keywords
1. Frequency: number of occurrences (incl. synonyms)
2. Position: beginning/end of a document, title, headers
3. Phrase length: longer means more specific
4. Similarity: semantic relatedness to other candidates
5. Corpus statistics: how prominent in this particular text
6. Popularity: how often people select this candidate
7. Part of speech pattern: some patterns are more common
…
12. How keyword extraction works
Document Candidates Properties Scoring Keywords
Heuristics Supervised machine learning
A formula that combines most Train a model from manually
powerful features indexed documents
• requires accurate crafting • requires training data
• performs equaly well or less well • performs really well on docs that
across various domains are similar to training data, but
poorly on dissimilar ones
13. How accurate is keyword extraction?
• It’s subjective…
• But: the higher the indexing consistency is,
the better the search effectiveness (findability)
A – set of keyphrases 1
A B – set of keyphrases 2
C – set of keyphrases in common
C
ConsistencyRolling = 2C / (A + B)
B
ConsistencyHopper = C / (A + B – C)
14. Professional indexers’ keywords*
Agrovoc terms: energy public
value nutritional health
disorders regulations
weight
reduction nutrient disease developing
excesses control countries
nutritional
diet requirements
dietary nutrition nutrition developed
guidelines feeding status programs countries
meal habits
patterns nutrition
surveillance
overweight
food
nutritional policies price
physiology
formation
food
overeating intake human nutrition
nutrition policies
price
foods food
fiscal policies
consumption
policies
prices
direct
urbanization globalization
taxation
taxes
* 6 professional FAO indexers assigned terms from the Agrovoc thesaurus
to the same document, entitled “The global obesity problem”
15. Comparison of 2 indexers
Agrovoc terms: energy public
Agrovoc relation: value nutritional health
disorders regulations
Indexer 1: weight
reduction nutrient disease developing
Indexer 2: excesses countries
control
nutritional
diet requirements
dietary nutrition nutrition developed
guidelines feeding status programs countries
meal habits
patterns nutrition
surveillance
overweight
food
nutritional policies price
physiology
formation
food
overeating intake human nutrition
nutrition policies
price
foods food
fiscal policies
consumption
policies
prices
direct
urbanization globalization
taxation
taxes
16. Comparison of 6 indexers & Kea
Agrovoc terms: energy public
Agrovoc relation: value nutritional health
disorders regulations
Indexers: weight
reduction nutrient
1 2 3 4 5 6 disease developing
excesses control countries
nutritional
Kea Algorithm: diet requirements
dietary nutrition nutrition developed
guidelines feeding status programs countries
meal habits
patterns nutrition
body weight overweight surveillance
food
nutritional policies price
physiology
formation
price fixing
saturated fat food
overeating intake human nutrition
nutrition policies controlled prices
foods food price
policies
consumption fiscal policies
policies prices
direct
urbanization globalization
taxation
taxes
17. Comparison of CS students* & Maui
* 15 teams of 2 students each assigned keywords to the same document,
entitled “A safe, efficient regression test selection technique”
18. Human vs. algorithm consistency
6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus
Method Min Avg Max
Professionals 26 39 47
KEA 24 32 38
15 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary
Method Min Avg Max
Students 21 31 37
Maui 24 32 36
CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing
With other taggers With Maui
330 taggers & 180 docs 19 24
35 taggers & 140 docs 38 35
19. Text Analytics on 2 Million Documents:
A Case Study
+
Collaboration with Gene Golovchinsky
fxpal.com/?p=gene
20. The dataset
Twitter
490 Million
CiteSeer tweets per
1.7 Million week
scientific 84 GB
publications
110 GB Wikipedia
3.6 Million articles
13 GB
Britannica
0.65 Million articles
ICWSM 2011 0.3 GB
2.1 TB (compressed!)
News, blogs, forums, etc.
slideshare.net/raffikrikorian/twitter-by-the-numbers
en.wikipedia.org/wiki/Wikipedia:Size_comparisons
21. The task goal
1. Extract all phrases that appear in search results
2. Weigh and suggest the best phrases for query refinement
Gene’s collaborative search system Querium
22. Step 1: Get time estimates
A. Take a subset, e.g. 100 documents
B. Run on various machines / settings
C. Extrapolate to the entire dataset, e.g. 1.7M docs
Our example:
• Standard laptop 4 Core, 8GB RAM: 30 days
• Similar Rackspace VM: 46 days
• Threading reduces time: 24 days
23. Step 2: Look into your data
Understand the nature of your data:
look at samples, compute statistcs.
Speed up by removing anomalies & targetting the text analytics.
Our example:
30% docs exceed 50KB (some ≈600KB)
Most important phrase appear in title,
abstract, introduction and conclusions.
Only process top 30% and last 20%
This reduces the time by 57%!
24. Validate: Can we crop our documents?
Top 20 keywords from*…
…original document ...cropped document
Top N How many were ontology ontology
knowledge base knowledge base
keywords in found in the knowledge knowledge engineering
original doc cropped doc representation knowledge
Semantic Web representation
10 91% WordNet WordNet
50 80% knowledge engineering predicate logic
predicate logic artificial intelligence
100 75% artificial intelligence ontology engineering
semantic networks semantic networks
All 64% natural language Semantic Web
first-order logic first-order logic
ontology engineering block diagram
lexicon dynamic systems
conceptual graphs higher-order logic
higher-order logic conceptual graphs
natural language modeling & simulation
processing universe of discourse
* Toward principles for the design of design rationale bond graph
ontologies used for knowledge sharing block diagram lexicon
T. R. Gruber (1993)
25. Step 3: Go cloud
Don’t be afraid to bring out the big guns
• Large Elastic Compute instance
1000 docs x 4 threads = 30 min
• High-CPU Extra Large (8 virtual cores)
1000 docs x 24 threads = 6 min
Also: increase the number of machines
• 4 machines = 4 times faster,
i.e. 50 instead of 200 hours (or 1 weekend!)
26. How long would a human
need to extract keywords
from 1.7M docs?
Min per Min Hours Days* Years**
doc
1 1.700.000 28.333 3.542 14
2 3.400.000 56.666 7.083 28
3 5.100.000 85.000 10.625 42
* Taking into account 8h per working day
** Assuming 250 working days per year (no holidays, no sickdays)
http://www.flickr.com/photos/mararie/2663711551/
27. Document Candidates Properties Scoring Keywords
To estimate quality, take a sample and compute
inter-indexer consistency between several people
CiteSeer
1.7 Million
scientific
publications
110 GB 1. Get time estimates
Can be done 2. Look into your data
in a weekend 3. Go cloud
Don’t do it manually!
Keyword extraction : medelyan.com/files/phd2009.pdf
CiteSeer study: pingar.com/technical-blog/
Pingar API: apidemo.pingar.com
Notas do Editor
KEA performs better than 8 of the best taggers
Dev machine: 4 Core CPU, 8 RAMRackspace
So among the top 10 keywords from the full document, 91% appear in the keywords from the chopped document (so, basically 9 out of 10 are the same),