2. Platform overview
A single unified platform for all content types (consolidate
to reduce development and maintenance costs)
Flexible system which can support any new content type
High automation (cut configuration costs)
Real time coverage or as close as possible for each content
type
Improved data quality using validation rules
Was implemented this year
January 1, 2013 Onlineextrems.com
3. Supporting all the content types
Message boards
Blogs and micro blogs (Myspace, Blogger, Live Journal...)
Blog comments
Social networks – Facebook, Linkedin, Xing
Author profiles
Product reviews
Usenet – mailing lists, groups
Traditional media – CNN, Reuters
January 1, 2013 Onlineextrems.com
4. Consolidating the content systems
Data mining systems
Message boards
Blogs
Social Networking sites
Author profiles system
Usenet + Newsgroups system
January 1, 2013 Onlineextrems.com 4
5. Some of our challenges
Dynamic nature of the web
Supporting many different types of content
Automatically “understanding” millions of sites with different structures
Over 8000 message boards
Over 95 million blogs
Supporting data in different languages
Data quality
January 1, 2013 Onlineextrems.com
6. Data mining process
What are the important aspects of the data mining?
Managing the order in which we crawl pages
Efficiency (e.g. not entering posts where the number of comments hasn’t
changed)
Next page (we need to follow it to get more comments)
Extracting relevant data out of everything on the page.
Separating the data into posts (or comments)
Transforming specific data into the desired format
Handling dates in differing formats
January 1, 2013 Onlineextrems.com
7. Data mining technologies
Jelly –Simple XML workflow engine
HttpClient - Fetcher
Rome –Feed parser
Velocity–Output template engine
JMX + JConsole – Managing the system
January 1, 2013 Onlineextrems.com
8. Flows
Built from steps which are the blocks
Allows adding support for new content types without
writing code
The implementation is based on Apache Jelly which allows
executing XML files
January 1, 2013 Onlineextrems.com
9. XML parser
Parses the data from simple XML files into the
common in memory “items” structure
For now only supports elements and not attributes
Used for Twitter
January 1, 2013 Onlineextrems.com
10. HTML parser
Applies XSLT transformations to HTML pages
Extracts the data into the common in memory “items”
structure
Uses “Tag Soup” library to read HTML as if it were XML
Faster and more robust than the current XML conversion
method
Used for Author Profiles
January 1, 2013 Onlineextrems.com
11. XML Output
Output in XML files
Configurable output format using template file
January 1, 2013 Onlineextrems.com