Clustering output of Apache Nutch using Apache Spark
1. Clustering the output of Apache Nutch
using Apache Spark
Thamme Gowda N. Dr. Chris Mattmann
May 12, 2016. Vancouver, Canada
1
2. About
● ThammeGowda Narayanaswamy - TG in short - @thammegowda
○ Contributor to Apache Tika and Apache Nutch
○ Now - a grad student @ University of Southern California
○ Past - Technical Co-Founder @ Datoin - http://datoin.com
● Dr. Chris Mattmann @chrismattmann
○ Adj. Prof. and the director of IRDS group
@ University of Southern California, Los Angeles
○ Director @ Apache Software Foundation
○ Chief Architect, NASA JPL
2
3. Overview
● Problem Statement
● Clustering - a solution
● Structure and Style Similarity
● Shared Near Neighbor Clustering
● Scaling it up using Spark’s Distributed Matrices and
GraphX
● A demo
3
4. Audience
● Who crawls the web
● Who extracts data from web
● Who filters webpages
● likes to know -
○ web page structure and style similarity
○ shared near neighbor clustering
4
5. Problem Statement
● Scraping data from online marketplaces
● Start with homepage → categories
→listing pages → Actual stuff (Detail page)
●
5
6. Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
6
7. Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
7
8. Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
8
9. Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
9
10. Question : How do we solve this?
Answer : Cluster the web pages
10
11. Why Cluster?
● Separate the interesting web pages?
○ Drop uninteresting/noisy web pages
○ Categorical treatment of clusters
● Extract Structured data using XPath
○ Automated extraction using alignment
11
12. Goal
● Group web pages that are similar
● Similar in terms of
○ CSS Styles
○ DOM Structure
● Toolkit for experimentation with various thresholds
○ % of similarity in style and/or structure
○ Nice visualizations
12
13. How do we cluster?
● Based on similarity between pages
● Semantic similarity
○ meaning of the web pages
● Syntactic similarity
○ Web page structure, css styles
● This session has focus on syntactic aspect
13
14. Structural similarity
● Web pages are built with HTML
● HTML Doc → DOM tree
● a labeled ordered tree
● Structural similarity using tree
edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
14
15. (Minimum) Tree Edit Distance
● Edit distance measure similar to strings, but on
hierarchical data instead of sequences
● Number of editing operations required to transform one
tree into another.
● Three basic editing operations: INSERT, REMOVE and
REPLACE.
● An useful measure to quantify how similar (or dissimilar)
two trees are.
15
16. Example: Tree Edit Distance*
● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D.
(1989). Simple fast algorithms
for the editing distance
between trees and related
problems. SIAM journal on
computing,18(6), 1245-1262.
16
17. Style Similarity
● Have you noticed ?
○ Similar web pages have similar css styles
● XPath : ”//*[@class]/@class”
● Simple measure -
○ Jaccard Similarity on CSS class names
○
17
19. Aggregating the Style and Structure
● StructuralSimilarity : Normalized Tree Edit Distance
● StyleSimilarity : Jaccard Distance
● Combine on a linear scale
○ Aggregated = k . Structural + (1-k) Style
19
21. Implementation
● Read Nutch’s Segements
○ sparkContext.sequneceFile(...)
● Filter web pages
○ Robust content type detection -- Tika
● Structural Similarity
○ HTML to DOM Tree -- NeckoHtml
○ Tree Edit Distance -- Zhang Shasha’s algorithm
21
22. Implementation …
● Style Similarity
○ Query CSS class names using Xpath
● Similarity Matrix
○ sparkContext.cartesian() to get nxn cells
○ Spark’s Distributed (Coordinate) Matrix
● Persist the matrix for later experimentation with
multiple thresholds
22
23. Clustering
● Shared Near Neighbor Clustering
○ Jarvis et al , 1973
● With improvements
○ Graph based Implementation
■ Spark GraphX for the win!
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared
near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.
23
24. What’s good about this algorithm?
● What’s the difficulty with the most popular k-means?
○ Prior knowledge of clusters?
○ Mean/Average of documents in a cluster?
■ Average of DOM Trees?
■ Average of CSS styles?
○ Circular/Spherical/Globular shapes?
● Shared Near Neighbor Cluster
○ Similarity matrix - pluggable similarity measures - generic
○ Thresholds - numbers , percent of match
24
25. Shared Near Neighbor Algorithm
“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster”
25
26. Clustering Implementation
● Similarity Matrix to Graph
○ Clusters as nodes, similarity measure as edges
● Check for Similar neighbors
○
○ Filter on threshold and Merge
■ Immutable! - new graph for next iteration
○ Repeat
26
29. What’s ahead on the road?
● Integrate to Apache Nutch
● Auto Extraction
○ Unsupervised learning on structure of pages and scrape
the actual data of the web page
● Faster Tree Edit Distance
○ May be with approximation techniques
29
32. Acknowledgements
● Dr. Chris Mattmann
○ My mentor
○ Professor, Director at IRDS @ USC - http://irds.usc.edu
○ Director, Apache Software Foundation
● DARPA Memex project
32