Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Hw09 Understanding Natural Language
1. News and Blog Analysis
with Lydia
Charles Ward – Stony Brook University
Karthik Balaji, Levon Lloyd – General Sentiment
October 2nd, 2009
2. Outline
Lydia System Overview
News Analysis Examples
Data and Workflow Organization
Data Access Interface
Conclusion
3. Large-Scale News/Blog Analysis
The Lydia news/blog analysis system does a daily
analysis of over 1000+ English and foreign-language
online newspapers, plus blogs, and other text sources.
We currently track tens of millions of named entities in
the news and blogs, providing spatial, temporal,
relational and sentiment analysis.
Customer's track entities of interest using reports
generated in our user interface.
4. Lydia Text Analysis Phases
Lydia performs named entity recognition and analysis over large
text corpora.
Spidering: Lydia spiders and parses thousands of online news
sources. We also handle the feed of social media provided by
Spinn3r.
Named Entity Recognition: Lydia identifies and classifies
occurrences of named entities (people, places, companies,
etc.)
Sentiment Analysis: Lydia assigns sentiment scores to
identified entities using shallow NLP techniques.
Entity Statistics Aggregation: Lydia digests marked-up text
and produces usable entity statistics.
Data Exploration: Aggregated entity statistics are made
available through user interfaces and programming APIs for
detailed exploration of the data.
11. Ethnic Biases in News Coverage
Frequency of coverage of entities Percentage of population self-
with Hispanic names in the reporting as Hispanic in the 2000
U.S. news, 2004-2008 census. Courtesy of Wikipedia.
12. Ethnic Biases in News Coverage
(a) African
(b) Hispanic
(c) East Asian
(d) Indian
(e) Eastern
European
(f) Muslim
13. Juxtaposition Analysis
Top Juxtapositions for Barack Obama
Juxtapositions between Barack Obama and John McCain
14. Outline
Lydia System Overview
News Analysis Examples
Data and Workflow Organization
Data Access Interface
Conclusion
15. Hadoop in Lydia
The legacy Perl NLP pipeline runs in parallel on
Hadoop Streaming, generating articles with marked-up
entities which are stored as compressed XML in
HDFS.
To build or update Lydia entity statistics and indexes
for a single text corpus, over 80 map-reduce jobs are
necessary.
We have developed a custom workflow management
framework in Amazon EC2 to manage our data and
processing.
16. Lydia Workflow Framework
High-level concepts:
A depository is a statistics dataset derived from a
text corpus. It consists of artifacts.
Stored as a directory structure in HDFS
An artifact is a homogeneous dataset of a specific
type.
Examples:
Key-value artifacts, e.g. entity name -> frequency time
series
Lucene index artifacts (entity and article indexes)
Stored as a directory in HDFS containing several map-
reduce job output subdirectories named as date ranges
(we do updates on a daily granularity).
19. Job Input Selection
Artifact updates are incrementally propagated through
the dependency graph:
Multiple date ranges (sometimes overlapping) typically
exist for each artifact.
Some small artifacts get fully rebuilt on every update.
20. Depository Build Scheduling
The same tool is used for the initial depository build
and for updating it with new data.
Any set of target artifacts to build can be specified,
similarly to a makefile. Prerequisites of the targets are
automatically identified.
Artifacts are built in the correct order according to
dependencies.
The build process runs as a sequence of Hadoop
map-reduce jobs and occasional serial jobs.
21. Amazon EC2
We run Hadoop on Amazon EC2.
– Quickly scale capacity as requirements change.
10 extra large nodes for weekly data processing.
Amazon S3 is our persistent data store.
All our web services are hosted in dedicated amazon
nodes.
S3 is not meeting our required level-of-service
– Moving to EBS
22. Outline
Lydia System Overview
News Analysis Examples
Data and Workflow Organization
Data Access Interface
Conclusion
23. Depository Server
Random access to the Lydia depository, e.g.:
Monthly frequency time series of Barack Obama in all
U.S. sources
Top juxtapositions for Continental Airlines in February
2009
Sentiment time series for Michael Phelps in all U.S.
sources
Uses the mapfiles generated by map-reduce jobs.
Currently is not distributed (but we can put
different depositories on different machines).
Provides a caching subsystem to reduce the
number of HDFS accesses.
24. Artifact Date Range Merging
The depository server combines results from
multiple groups of mapfiles on the fly.
(MR output = date range = mapfile group)
This may result in performance problems and
memory shortage (direct memory buffers).
Solution: limit the number of covering date ranges
to be O(log N) after N daily updates.
25. Outline
Lydia System Overview
News Analysis Examples
Data and Workflow Organization
Data Access Interface
Conclusion
26. Conclusion
Great improvement (up to 20x) in the
Lydia system performance and
scalability from using Hadoop.
Lydia w/ Hadoop makes new types of
automated analysis of web-scale content
possible.