SlideShare a Scribd company logo
1 of 33
Download to read offline
Clustering the output of Apache Nutch
using Apache Spark
Thamme Gowda N. Dr. Chris Mattmann
May 12, 2016. Vancouver, Canada
1
About
● ThammeGowda Narayanaswamy - TG in short - @thammegowda
○ Contributor to Apache Tika and Apache Nutch
○ Now - a grad student @ University of Southern California
○ Past - Technical Co-Founder @ Datoin - http://datoin.com
● Dr. Chris Mattmann @chrismattmann
○ Adj. Prof. and the director of IRDS group
@ University of Southern California, Los Angeles
○ Director @ Apache Software Foundation
○ Chief Architect, NASA JPL
2
Overview
● Problem Statement
● Clustering - a solution
● Structure and Style Similarity
● Shared Near Neighbor Clustering
● Scaling it up using Spark’s Distributed Matrices and
GraphX
● A demo
3
Audience
● Who crawls the web
● Who extracts data from web
● Who filters webpages
● likes to know -
○ web page structure and style similarity
○ shared near neighbor clustering
4
Problem Statement
● Scraping data from online marketplaces
● Start with homepage → categories
→listing pages → Actual stuff (Detail page)
●
5
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
6
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
7
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
8
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
9
Question : How do we solve this?
Answer : Cluster the web pages
10
Why Cluster?
● Separate the interesting web pages?
○ Drop uninteresting/noisy web pages
○ Categorical treatment of clusters
● Extract Structured data using XPath
○ Automated extraction using alignment
11
Goal
● Group web pages that are similar
● Similar in terms of
○ CSS Styles
○ DOM Structure
● Toolkit for experimentation with various thresholds
○ % of similarity in style and/or structure
○ Nice visualizations
12
How do we cluster?
● Based on similarity between pages
● Semantic similarity
○ meaning of the web pages
● Syntactic similarity
○ Web page structure, css styles
● This session has focus on syntactic aspect
13
Structural similarity
● Web pages are built with HTML
● HTML Doc → DOM tree
● a labeled ordered tree
● Structural similarity using tree
edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
14
(Minimum) Tree Edit Distance
● Edit distance measure similar to strings, but on
hierarchical data instead of sequences
● Number of editing operations required to transform one
tree into another.
● Three basic editing operations: INSERT, REMOVE and
REPLACE.
● An useful measure to quantify how similar (or dissimilar)
two trees are.
15
Example: Tree Edit Distance*
● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D.
(1989). Simple fast algorithms
for the editing distance
between trees and related
problems. SIAM journal on
computing,18(6), 1245-1262.
16
Style Similarity
● Have you noticed ?
○ Similar web pages have similar css styles
● XPath : ”//*[@class]/@class”
● Simple measure -
○ Jaccard Similarity on CSS class names
○
17
Web pages consists of :
● HTML ✓
● CSS ✓
● JavaScript ×
18
Aggregating the Style and Structure
● StructuralSimilarity : Normalized Tree Edit Distance
● StyleSimilarity : Jaccard Distance
● Combine on a linear scale
○ Aggregated = k . Structural + (1-k) Style
19
Implementation
20
Implementation
● Read Nutch’s Segements
○ sparkContext.sequneceFile(...)
● Filter web pages
○ Robust content type detection -- Tika
● Structural Similarity
○ HTML to DOM Tree -- NeckoHtml
○ Tree Edit Distance -- Zhang Shasha’s algorithm
21
Implementation …
● Style Similarity
○ Query CSS class names using Xpath
● Similarity Matrix
○ sparkContext.cartesian() to get nxn cells
○ Spark’s Distributed (Coordinate) Matrix
● Persist the matrix for later experimentation with
multiple thresholds
22
Clustering
● Shared Near Neighbor Clustering
○ Jarvis et al , 1973
● With improvements
○ Graph based Implementation
■ Spark GraphX for the win!
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared
near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.
23
What’s good about this algorithm?
● What’s the difficulty with the most popular k-means?
○ Prior knowledge of clusters?
○ Mean/Average of documents in a cluster?
■ Average of DOM Trees?
■ Average of CSS styles?
○ Circular/Spherical/Globular shapes?
● Shared Near Neighbor Cluster
○ Similarity matrix - pluggable similarity measures - generic
○ Thresholds - numbers , percent of match
24
Shared Near Neighbor Algorithm
“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster”
25
Clustering Implementation
● Similarity Matrix to Graph
○ Clusters as nodes, similarity measure as edges
● Check for Similar neighbors
○
○ Filter on threshold and Merge
■ Immutable! - new graph for next iteration
○ Repeat
26
Shared Near Neighbor Clustering on
Apache Spark GraphX
27
Challenges
● Tree Edit Distance is very expensive
28
What’s ahead on the road?
● Integrate to Apache Nutch
● Auto Extraction
○ Unsupervised learning on structure of pages and scrape
the actual data of the web page
● Faster Tree Edit Distance
○ May be with approximation techniques
29
Demo
30
Summary
● Example Scenario
● Similarity measures
● Clustering as a solution
● Demo
31
Acknowledgements
● Dr. Chris Mattmann
○ My mentor
○ Professor, Director at IRDS @ USC - http://irds.usc.edu
○ Director, Apache Software Foundation
● DARPA Memex project
32
Thank You!
● Source Code
● Tutorial
● Follow up
○ Thamme Gowda - @thammegowda
○ Chris Mattmann - @chrismattmann
33

More Related Content

What's hot

Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataGraph-TA
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANSvty
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage InformationEnno Meijers
 
Open data easy, explicit and fast
Open data easy, explicit and fastOpen data easy, explicit and fast
Open data easy, explicit and fastMetaSolutions AB
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinkingwhalb
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyondErnesto Reig
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xmlFelix Sasaki
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataAlbert Meroño-Peñuela
 
NO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudNO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudManu Cohen-Yashar
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsPeter Haase
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsdgarijo
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Tsendsuren Munkhdalai
 

What's hot (20)

Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Graph database
Graph database Graph database
Graph database
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANS
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
Open data easy, explicit and fast
Open data easy, explicit and fastOpen data easy, explicit and fast
Open data easy, explicit and fast
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinking
 
Pandas
PandasPandas
Pandas
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked data
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
DBPedia-past-present-future
DBPedia-past-present-futureDBPedia-past-present-future
DBPedia-past-present-future
 
NO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudNO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloud
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2
 

Similar to Clustering output of Apache Nutch using Apache Spark

How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's studentsMohamed Nadjib MAMI
 
OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6Peter Tröger
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan Casey
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Inductiongregoryg
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022ArangoDB Database
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Machine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsMachine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsChristopherWoodward16
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokesGagan Bajpai
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022ArangoDB Database
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf4NM20IS025BHUSHANNAY
 

Similar to Clustering output of Apache Nutch using Apache Spark (20)

How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Induction
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Machine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsMachine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better Recommendations
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf
 

More from Thamme Gowda

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important tooThamme Gowda
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation ModelThamme Gowda
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Thamme Gowda
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGThamme Gowda
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Thamme Gowda
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda
 

More from Thamme Gowda (7)

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 

Recently uploaded

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 

Recently uploaded (20)

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 

Clustering output of Apache Nutch using Apache Spark

  • 1. Clustering the output of Apache Nutch using Apache Spark Thamme Gowda N. Dr. Chris Mattmann May 12, 2016. Vancouver, Canada 1
  • 2. About ● ThammeGowda Narayanaswamy - TG in short - @thammegowda ○ Contributor to Apache Tika and Apache Nutch ○ Now - a grad student @ University of Southern California ○ Past - Technical Co-Founder @ Datoin - http://datoin.com ● Dr. Chris Mattmann @chrismattmann ○ Adj. Prof. and the director of IRDS group @ University of Southern California, Los Angeles ○ Director @ Apache Software Foundation ○ Chief Architect, NASA JPL 2
  • 3. Overview ● Problem Statement ● Clustering - a solution ● Structure and Style Similarity ● Shared Near Neighbor Clustering ● Scaling it up using Spark’s Distributed Matrices and GraphX ● A demo 3
  • 4. Audience ● Who crawls the web ● Who extracts data from web ● Who filters webpages ● likes to know - ○ web page structure and style similarity ○ shared near neighbor clustering 4
  • 5. Problem Statement ● Scraping data from online marketplaces ● Start with homepage → categories →listing pages → Actual stuff (Detail page) ● 5
  • 6. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 6
  • 7. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS 7
  • 8. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS 8
  • 9. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS USEFUL FOR ANALYSIS USEFUL FOR ANALYSIS USEFUL FOR ANALYSIS 9
  • 10. Question : How do we solve this? Answer : Cluster the web pages 10
  • 11. Why Cluster? ● Separate the interesting web pages? ○ Drop uninteresting/noisy web pages ○ Categorical treatment of clusters ● Extract Structured data using XPath ○ Automated extraction using alignment 11
  • 12. Goal ● Group web pages that are similar ● Similar in terms of ○ CSS Styles ○ DOM Structure ● Toolkit for experimentation with various thresholds ○ % of similarity in style and/or structure ○ Nice visualizations 12
  • 13. How do we cluster? ● Based on similarity between pages ● Semantic similarity ○ meaning of the web pages ● Syntactic similarity ○ Web page structure, css styles ● This session has focus on syntactic aspect 13
  • 14. Structural similarity ● Web pages are built with HTML ● HTML Doc → DOM tree ● a labeled ordered tree ● Structural similarity using tree edit distance(TED) HTML HEAD BODY TITLE DIV P 14
  • 15. (Minimum) Tree Edit Distance ● Edit distance measure similar to strings, but on hierarchical data instead of sequences ● Number of editing operations required to transform one tree into another. ● Three basic editing operations: INSERT, REMOVE and REPLACE. ● An useful measure to quantify how similar (or dissimilar) two trees are. 15
  • 16. Example: Tree Edit Distance* ● Edit operations ● Normalized distance * Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262. 16
  • 17. Style Similarity ● Have you noticed ? ○ Similar web pages have similar css styles ● XPath : ”//*[@class]/@class” ● Simple measure - ○ Jaccard Similarity on CSS class names ○ 17
  • 18. Web pages consists of : ● HTML ✓ ● CSS ✓ ● JavaScript × 18
  • 19. Aggregating the Style and Structure ● StructuralSimilarity : Normalized Tree Edit Distance ● StyleSimilarity : Jaccard Distance ● Combine on a linear scale ○ Aggregated = k . Structural + (1-k) Style 19
  • 21. Implementation ● Read Nutch’s Segements ○ sparkContext.sequneceFile(...) ● Filter web pages ○ Robust content type detection -- Tika ● Structural Similarity ○ HTML to DOM Tree -- NeckoHtml ○ Tree Edit Distance -- Zhang Shasha’s algorithm 21
  • 22. Implementation … ● Style Similarity ○ Query CSS class names using Xpath ● Similarity Matrix ○ sparkContext.cartesian() to get nxn cells ○ Spark’s Distributed (Coordinate) Matrix ● Persist the matrix for later experimentation with multiple thresholds 22
  • 23. Clustering ● Shared Near Neighbor Clustering ○ Jarvis et al , 1973 ● With improvements ○ Graph based Implementation ■ Spark GraphX for the win! * Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034. 23
  • 24. What’s good about this algorithm? ● What’s the difficulty with the most popular k-means? ○ Prior knowledge of clusters? ○ Mean/Average of documents in a cluster? ■ Average of DOM Trees? ■ Average of CSS styles? ○ Circular/Spherical/Globular shapes? ● Shared Near Neighbor Cluster ○ Similarity matrix - pluggable similarity measures - generic ○ Thresholds - numbers , percent of match 24
  • 25. Shared Near Neighbor Algorithm “If two data points share a threshold number of neighbors, then they must belong to the same cluster” 25
  • 26. Clustering Implementation ● Similarity Matrix to Graph ○ Clusters as nodes, similarity measure as edges ● Check for Similar neighbors ○ ○ Filter on threshold and Merge ■ Immutable! - new graph for next iteration ○ Repeat 26
  • 27. Shared Near Neighbor Clustering on Apache Spark GraphX 27
  • 28. Challenges ● Tree Edit Distance is very expensive 28
  • 29. What’s ahead on the road? ● Integrate to Apache Nutch ● Auto Extraction ○ Unsupervised learning on structure of pages and scrape the actual data of the web page ● Faster Tree Edit Distance ○ May be with approximation techniques 29
  • 31. Summary ● Example Scenario ● Similarity measures ● Clustering as a solution ● Demo 31
  • 32. Acknowledgements ● Dr. Chris Mattmann ○ My mentor ○ Professor, Director at IRDS @ USC - http://irds.usc.edu ○ Director, Apache Software Foundation ● DARPA Memex project 32
  • 33. Thank You! ● Source Code ● Tutorial ● Follow up ○ Thamme Gowda - @thammegowda ○ Chris Mattmann - @chrismattmann 33