SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Clustering the output of Apache Nutch
using Apache Spark
Thamme Gowda N. Dr. Chris Mattmann
May 12, 2016. Vancouver, Canada
1
About
● ThammeGowda Narayanaswamy - TG in short - @thammegowda
○ Contributor to Apache Tika and Apache Nutch
○ Now - a grad student @ University of Southern California
○ Past - Technical Co-Founder @ Datoin - http://datoin.com
● Dr. Chris Mattmann @chrismattmann
○ Adj. Prof. and the director of IRDS group
@ University of Southern California, Los Angeles
○ Director @ Apache Software Foundation
○ Chief Architect, NASA JPL
2
Overview
● Problem Statement
● Clustering - a solution
● Structure and Style Similarity
● Shared Near Neighbor Clustering
● Scaling it up using Spark’s Distributed Matrices and
GraphX
● A demo
3
Audience
● Who crawls the web
● Who extracts data from web
● Who filters webpages
● likes to know -
○ web page structure and style similarity
○ shared near neighbor clustering
4
Problem Statement
● Scraping data from online marketplaces
● Start with homepage → categories
→listing pages → Actual stuff (Detail page)
●
5
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
6
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
7
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
8
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
9
Question : How do we solve this?
Answer : Cluster the web pages
10
Why Cluster?
● Separate the interesting web pages?
○ Drop uninteresting/noisy web pages
○ Categorical treatment of clusters
● Extract Structured data using XPath
○ Automated extraction using alignment
11
Goal
● Group web pages that are similar
● Similar in terms of
○ CSS Styles
○ DOM Structure
● Toolkit for experimentation with various thresholds
○ % of similarity in style and/or structure
○ Nice visualizations
12
How do we cluster?
● Based on similarity between pages
● Semantic similarity
○ meaning of the web pages
● Syntactic similarity
○ Web page structure, css styles
● This session has focus on syntactic aspect
13
Structural similarity
● Web pages are built with HTML
● HTML Doc → DOM tree
● a labeled ordered tree
● Structural similarity using tree
edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
14
(Minimum) Tree Edit Distance
● Edit distance measure similar to strings, but on
hierarchical data instead of sequences
● Number of editing operations required to transform one
tree into another.
● Three basic editing operations: INSERT, REMOVE and
REPLACE.
● An useful measure to quantify how similar (or dissimilar)
two trees are.
15
Example: Tree Edit Distance*
● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D.
(1989). Simple fast algorithms
for the editing distance
between trees and related
problems. SIAM journal on
computing,18(6), 1245-1262.
16
Style Similarity
● Have you noticed ?
○ Similar web pages have similar css styles
● XPath : ”//*[@class]/@class”
● Simple measure -
○ Jaccard Similarity on CSS class names
○
17
Web pages consists of :
● HTML ✓
● CSS ✓
● JavaScript ×
18
Aggregating the Style and Structure
● StructuralSimilarity : Normalized Tree Edit Distance
● StyleSimilarity : Jaccard Distance
● Combine on a linear scale
○ Aggregated = k . Structural + (1-k) Style
19
Implementation
20
Implementation
● Read Nutch’s Segements
○ sparkContext.sequneceFile(...)
● Filter web pages
○ Robust content type detection -- Tika
● Structural Similarity
○ HTML to DOM Tree -- NeckoHtml
○ Tree Edit Distance -- Zhang Shasha’s algorithm
21
Implementation …
● Style Similarity
○ Query CSS class names using Xpath
● Similarity Matrix
○ sparkContext.cartesian() to get nxn cells
○ Spark’s Distributed (Coordinate) Matrix
● Persist the matrix for later experimentation with
multiple thresholds
22
Clustering
● Shared Near Neighbor Clustering
○ Jarvis et al , 1973
● With improvements
○ Graph based Implementation
■ Spark GraphX for the win!
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared
near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.
23
What’s good about this algorithm?
● What’s the difficulty with the most popular k-means?
○ Prior knowledge of clusters?
○ Mean/Average of documents in a cluster?
■ Average of DOM Trees?
■ Average of CSS styles?
○ Circular/Spherical/Globular shapes?
● Shared Near Neighbor Cluster
○ Similarity matrix - pluggable similarity measures - generic
○ Thresholds - numbers , percent of match
24
Shared Near Neighbor Algorithm
“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster”
25
Clustering Implementation
● Similarity Matrix to Graph
○ Clusters as nodes, similarity measure as edges
● Check for Similar neighbors
○
○ Filter on threshold and Merge
■ Immutable! - new graph for next iteration
○ Repeat
26
Shared Near Neighbor Clustering on
Apache Spark GraphX
27
Challenges
● Tree Edit Distance is very expensive
28
What’s ahead on the road?
● Integrate to Apache Nutch
● Auto Extraction
○ Unsupervised learning on structure of pages and scrape
the actual data of the web page
● Faster Tree Edit Distance
○ May be with approximation techniques
29
Demo
30
Summary
● Example Scenario
● Similarity measures
● Clustering as a solution
● Demo
31
Acknowledgements
● Dr. Chris Mattmann
○ My mentor
○ Professor, Director at IRDS @ USC - http://irds.usc.edu
○ Director, Apache Software Foundation
● DARPA Memex project
32
Thank You!
● Source Code
● Tutorial
● Follow up
○ Thamme Gowda - @thammegowda
○ Chris Mattmann - @chrismattmann
33

Mais conteúdo relacionado

Mais procurados

Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataGraph-TA
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANSvty
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage InformationEnno Meijers
 
Open data easy, explicit and fast
Open data easy, explicit and fastOpen data easy, explicit and fast
Open data easy, explicit and fastMetaSolutions AB
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinkingwhalb
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyondErnesto Reig
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xmlFelix Sasaki
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataAlbert Meroño-Peñuela
 
NO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudNO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudManu Cohen-Yashar
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsPeter Haase
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsdgarijo
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Tsendsuren Munkhdalai
 

Mais procurados (20)

Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Graph database
Graph database Graph database
Graph database
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANS
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
Open data easy, explicit and fast
Open data easy, explicit and fastOpen data easy, explicit and fast
Open data easy, explicit and fast
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinking
 
Pandas
PandasPandas
Pandas
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked data
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
DBPedia-past-present-future
DBPedia-past-present-futureDBPedia-past-present-future
DBPedia-past-present-future
 
NO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudNO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloud
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2
 

Semelhante a Clustering output of Apache Nutch using Apache Spark

How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's studentsMohamed Nadjib MAMI
 
OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6Peter Tröger
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan Casey
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Inductiongregoryg
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022ArangoDB Database
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Machine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsMachine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsChristopherWoodward16
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokesGagan Bajpai
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022ArangoDB Database
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf4NM20IS025BHUSHANNAY
 

Semelhante a Clustering output of Apache Nutch using Apache Spark (20)

How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Induction
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Machine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsMachine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better Recommendations
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf
 

Mais de Thamme Gowda

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important tooThamme Gowda
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation ModelThamme Gowda
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Thamme Gowda
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGThamme Gowda
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Thamme Gowda
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda
 

Mais de Thamme Gowda (7)

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Clustering output of Apache Nutch using Apache Spark

  • 1. Clustering the output of Apache Nutch using Apache Spark Thamme Gowda N. Dr. Chris Mattmann May 12, 2016. Vancouver, Canada 1
  • 2. About ● ThammeGowda Narayanaswamy - TG in short - @thammegowda ○ Contributor to Apache Tika and Apache Nutch ○ Now - a grad student @ University of Southern California ○ Past - Technical Co-Founder @ Datoin - http://datoin.com ● Dr. Chris Mattmann @chrismattmann ○ Adj. Prof. and the director of IRDS group @ University of Southern California, Los Angeles ○ Director @ Apache Software Foundation ○ Chief Architect, NASA JPL 2
  • 3. Overview ● Problem Statement ● Clustering - a solution ● Structure and Style Similarity ● Shared Near Neighbor Clustering ● Scaling it up using Spark’s Distributed Matrices and GraphX ● A demo 3
  • 4. Audience ● Who crawls the web ● Who extracts data from web ● Who filters webpages ● likes to know - ○ web page structure and style similarity ○ shared near neighbor clustering 4
  • 5. Problem Statement ● Scraping data from online marketplaces ● Start with homepage → categories →listing pages → Actual stuff (Detail page) ● 5
  • 6. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 6
  • 7. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS 7
  • 8. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS 8
  • 9. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS USEFUL FOR ANALYSIS USEFUL FOR ANALYSIS USEFUL FOR ANALYSIS 9
  • 10. Question : How do we solve this? Answer : Cluster the web pages 10
  • 11. Why Cluster? ● Separate the interesting web pages? ○ Drop uninteresting/noisy web pages ○ Categorical treatment of clusters ● Extract Structured data using XPath ○ Automated extraction using alignment 11
  • 12. Goal ● Group web pages that are similar ● Similar in terms of ○ CSS Styles ○ DOM Structure ● Toolkit for experimentation with various thresholds ○ % of similarity in style and/or structure ○ Nice visualizations 12
  • 13. How do we cluster? ● Based on similarity between pages ● Semantic similarity ○ meaning of the web pages ● Syntactic similarity ○ Web page structure, css styles ● This session has focus on syntactic aspect 13
  • 14. Structural similarity ● Web pages are built with HTML ● HTML Doc → DOM tree ● a labeled ordered tree ● Structural similarity using tree edit distance(TED) HTML HEAD BODY TITLE DIV P 14
  • 15. (Minimum) Tree Edit Distance ● Edit distance measure similar to strings, but on hierarchical data instead of sequences ● Number of editing operations required to transform one tree into another. ● Three basic editing operations: INSERT, REMOVE and REPLACE. ● An useful measure to quantify how similar (or dissimilar) two trees are. 15
  • 16. Example: Tree Edit Distance* ● Edit operations ● Normalized distance * Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262. 16
  • 17. Style Similarity ● Have you noticed ? ○ Similar web pages have similar css styles ● XPath : ”//*[@class]/@class” ● Simple measure - ○ Jaccard Similarity on CSS class names ○ 17
  • 18. Web pages consists of : ● HTML ✓ ● CSS ✓ ● JavaScript × 18
  • 19. Aggregating the Style and Structure ● StructuralSimilarity : Normalized Tree Edit Distance ● StyleSimilarity : Jaccard Distance ● Combine on a linear scale ○ Aggregated = k . Structural + (1-k) Style 19
  • 21. Implementation ● Read Nutch’s Segements ○ sparkContext.sequneceFile(...) ● Filter web pages ○ Robust content type detection -- Tika ● Structural Similarity ○ HTML to DOM Tree -- NeckoHtml ○ Tree Edit Distance -- Zhang Shasha’s algorithm 21
  • 22. Implementation … ● Style Similarity ○ Query CSS class names using Xpath ● Similarity Matrix ○ sparkContext.cartesian() to get nxn cells ○ Spark’s Distributed (Coordinate) Matrix ● Persist the matrix for later experimentation with multiple thresholds 22
  • 23. Clustering ● Shared Near Neighbor Clustering ○ Jarvis et al , 1973 ● With improvements ○ Graph based Implementation ■ Spark GraphX for the win! * Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034. 23
  • 24. What’s good about this algorithm? ● What’s the difficulty with the most popular k-means? ○ Prior knowledge of clusters? ○ Mean/Average of documents in a cluster? ■ Average of DOM Trees? ■ Average of CSS styles? ○ Circular/Spherical/Globular shapes? ● Shared Near Neighbor Cluster ○ Similarity matrix - pluggable similarity measures - generic ○ Thresholds - numbers , percent of match 24
  • 25. Shared Near Neighbor Algorithm “If two data points share a threshold number of neighbors, then they must belong to the same cluster” 25
  • 26. Clustering Implementation ● Similarity Matrix to Graph ○ Clusters as nodes, similarity measure as edges ● Check for Similar neighbors ○ ○ Filter on threshold and Merge ■ Immutable! - new graph for next iteration ○ Repeat 26
  • 27. Shared Near Neighbor Clustering on Apache Spark GraphX 27
  • 28. Challenges ● Tree Edit Distance is very expensive 28
  • 29. What’s ahead on the road? ● Integrate to Apache Nutch ● Auto Extraction ○ Unsupervised learning on structure of pages and scrape the actual data of the web page ● Faster Tree Edit Distance ○ May be with approximation techniques 29
  • 31. Summary ● Example Scenario ● Similarity measures ● Clustering as a solution ● Demo 31
  • 32. Acknowledgements ● Dr. Chris Mattmann ○ My mentor ○ Professor, Director at IRDS @ USC - http://irds.usc.edu ○ Director, Apache Software Foundation ● DARPA Memex project 32
  • 33. Thank You! ● Source Code ● Tutorial ● Follow up ○ Thamme Gowda - @thammegowda ○ Chris Mattmann - @chrismattmann 33