SlideShare uma empresa Scribd logo
Information Retrieval
and Data Science
Thamme Gowda
@thammegowda
Karanjeet Singh
@_karanjeet
A	web-crawler	on	Apache	Spark
Feb	7-9,	2017Spark	Summit	East	2017,	Boston 1
SPARKLER
Dr. Chris Mattmann
@chrismattmann
https://github.com/USCDataScience/sparkler
Information Retrieval
and Data Science
ABOUT
2
Information	Retrieval	and	Data	Science	(IRDS)	Group
University	of	Southern	California,	Los	Angeles,	CA
Home	page:	https://irds.usc.edu Email:	irds-L@mymaillists.usc.edu
Thamme	Gowda Dr.	Chris	MattmannKaranjeet	Singh
Graduate	Student
@thammegowda
Graduate	Student
@karanjeet_tw
Director,	IRDS
@chrismattmann
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
OVERVIEW
● About Sparkler
● Motivations for building Sparkler
● Sparkler technology stack, internals
● Features of Sparkler
● Dashboard
● Demo
● What’s Next ?
3Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
ABOUT:	SPARKLER
● New Open Source Web Crawler
•A bot program that can fetch resources from the web
● Name: Spark Crawler
● Inspired by Apache Nutch
● Like Nutch: Distributed crawler that can scale horizontally
● Unlike Nutch: Runs on top of Apache Spark
● Easy to deploy and easy to use
4Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
MOTIVATION	#1
● Challenges in DARPA MEMEX*
•MEMEX System has crawlers to fetch deep and dark web data
•ML based analysis to assist law keeping agencies
•Crawls are blackbox, we wanted real-time progress reports
● Dr. Chris Mattmann was considering an upgrade since 3 years
● Technology upgrade needed
5Feb	7-9,	2017Spark	Summit	East	2017,	Boston
* http://memex.jpl.nasa.gov/
Information Retrieval
and Data Science
WHY A NEW CRAWLER?
6
Modern	Hadoop	cluster	has	no	Hadoop	(Map-Reduce)	left	in	it!
https://twitter.com/cutting/status/796566255830503424
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
MOTIVATION	#2
● Challenges at DATOIN
•Intro: Datoin.com is a distributed text analytics platform
•Late 2014 - migrated the infrastructure from Hadoop Map Reduce to
Apache Spark
•But the crawler component (powered by Apache Nutch) was left behind
● Met Dr. Chris Mattmann at USC in Web Search Engines class
•Enquired about his thoughts for running Nutch on Spark
7Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: TECH STACK
● Batch crawling (similar to Apache Nutch)
● Apache Solr as crawl database
● Multi module Maven project with OSGi bundles
● Stream crawled content through Apache Kafka
● Parses everything using Apache Tika
● Crawl visualization - Banana
8Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: INTERNALS & WORKFLOW
9Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: CRAWLDB
10Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: RDD
11Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: LINKS PIPELINE
12Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: OUTPUT CONSUMPTION
13Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: FEATURES
14Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #1: Lucene/Solr powered Crawldb
● Crawldb needed indexing
•For real time analytics
•For instant visualizations
● This is internal data structure of sparkler
•Exposed over REST API
•Used by Sparkler-ui, the web application
● We chose Apache Solr
● Standalone Solr server or Solr cloud? Yes!
● Glued the crawldb and spark using CrawldbRDD
15Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #2: URL Partitioning
● Politeness
•Doesn’t hit same server too many times in distributed mode
● First version
•Group by: Host name
•Sort by: depth, score
● Customization is easy
•Write your own Solr query
•Take advantage of boosting to alter the ranking
● Partitions the dataset based on the above criteria
● Lazy evaluations and delay between the requests
•Performs parsing instead of waiting
•Inserts delay only when it is necessary
16Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #3: OSGI Plugins
● Plugins Interfaces are inspired by Nutch
● Plugins are developed as per Open Service Gateway Interface (OSGI)
● We chose Apache Felix implementation of OSGI
● Migrated a plugin from Nutch
•Regex URL Filter Plugin → The most used plugin in Nutch
● Added JavaScript plugin (described in the next slide)
● //TODO: Migrate more plugins from Nutch
•Mavenize nutch [NUTCH-2293]
17Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #4: JavaScript Rendering
● Java Script Execution* has first class support
•Allows Sparkler to crawl the Deep/Dark web too
● Distributable on Spark Cluster without pain
•Pure JVM based JavaScript engine
● This is an implementation of FetchFunction
● FetchFunction
•Stream<URL> → Stream<Content>
•Note: URLS are grouped by host
•Preserves cookies and reuses sessions for each iteration
18
Thanks to: Madhav Sharan
Member of USC IRDS* JBrowserDriver by MachinePublishers
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #5: Output in Kafka Streams
● Crawler is sometimes input for the applications that does deeper analysis
•Can’t fit all those deeper analysis into crawler
● Integrating to such applications made easy via Queues
● We chose Apache Kafka
•Suits our need
•Distributable, Scalable, Fault Tolerant
● FIXME: Larger messages such as Videos
● This is optional, default output on Shared File System (such as HDFS),
compatible with Nutch
19
Thanks to: Rahul Palamuttam
MS CS @ Stanford University; Intern @ NASA JPL
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #6: Tika, the universal parser
● Apache Tika
•Is a toolkit of parsers
•Detects and extracts metadata, text, and URLS
•Over a thousand different file types
● Main application is to discover outgoing links
● The default Implementation for our ParseFunction
20Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #7: Visual Analytics
● Charts and Graphs provides nice summary of crawl job
● Real time analytics
● Example:
•Distribution of URLS across hosts/domains
•Temporal activities
•Status reports
● Customizable in real time
● Using Banana Dashboard from Lucidworks
● Sparkler has a sub component named sparkler-ui
21
Thanks to: Manish Dwibedy
MS CS University of Southern California
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #7 DASHBOARD
22Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #8: Deployment
● Docker
● Juju Charms
23
Thanks to: Tom Barber
Spicule Analytics & NASA-JPL
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #Next: What’s coming?
● Scoring Crawled Pages (Work in progress)
● Focused Crawling (Work in progress)
● Domain Discovery (Work in progress)
● Detailed documentation and tutorials on wiki (Work in progress)
● Interactive UI
● Crawl Graph Analysis
● Other useful plugins from Nutch
24Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Being used for Polar Deep Insights project
https://www.earthcube.org/group/polar-data-insights-search-analytics-deep-scientific-web
Information Retrieval
and Data Science
DEMO
25
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
$ bin/dockler.sh
Information Retrieval
and Data Science
QUESTIONS?
26
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
THANK YOU
27
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston

Mais conteúdo relacionado

Mais procurados

Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...తేజ దండిభట్ల
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientistsStitch Fix Algorithms
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Karanjeet Singh
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark PresentationNidhin Pattaniyil
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Integration of data ninja services with oracle spatial and graph
Integration of data ninja services with oracle spatial and graphIntegration of data ninja services with oracle spatial and graph
Integration of data ninja services with oracle spatial and graphData Ninja API
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Linked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceLinked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceBarry Norton
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 

Mais procurados (18)

Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark Presentation
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Integration of data ninja services with oracle spatial and graph
Integration of data ninja services with oracle spatial and graphIntegration of data ninja services with oracle spatial and graph
Integration of data ninja services with oracle spatial and graph
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Linked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceLinked Data, Ontologies and Inference
Linked Data, Ontologies and Inference
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Destaque

Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain AnalyticsSigmoid
 
Building high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosBuilding high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosSigmoid
 
WEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSWEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSSigmoid
 
Equation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkEquation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkSigmoid
 
Angular js performance improvements
Angular js performance improvementsAngular js performance improvements
Angular js performance improvementsSigmoid
 
Failsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workFailsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workSigmoid
 
Graph computation
Graph computationGraph computation
Graph computationSigmoid
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing sparkSigmoid
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internalsSigmoid
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sigmoid
 
Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platformsSigmoid
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutchSigmoid
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysisSigmoid
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Joining Large data at Scale
Joining Large data at ScaleJoining Large data at Scale
Joining Large data at ScaleSigmoid
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingSigmoid
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 

Destaque (20)

Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain Analytics
 
Building high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosBuilding high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesos
 
WEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSWEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERS
 
Equation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkEquation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-spark
 
Angular js performance improvements
Angular js performance improvementsAngular js performance improvements
Angular js performance improvements
 
Failsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workFailsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they work
 
Graph computation
Graph computationGraph computation
Graph computation
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing spark
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)
 
Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Joining Large data at Scale
Joining Large data at ScaleJoining Large data at Scale
Joining Large data at Scale
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 

Semelhante a Sparkler at spark summit east 2017

End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski
 
Incrementally streaming rdbms data to your data lake automagically
Incrementally streaming rdbms data to your data lake automagicallyIncrementally streaming rdbms data to your data lake automagically
Incrementally streaming rdbms data to your data lake automagicallyTimothy Spann
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksSecure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksTheofilos Kakantousis
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Evention
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...Amazon Web Services
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesData Ninja API
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scaleMark Schroering
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchTaswar Bhatti
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkROlgun Aydın
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxASIMKHAN840563
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAsMaxym Kharchenko
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018Kelley Robinson
 

Semelhante a Sparkler at spark summit east 2017 (20)

End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
 
Incrementally streaming rdbms data to your data lake automagically
Incrementally streaming rdbms data to your data lake automagicallyIncrementally streaming rdbms data to your data lake automagically
Incrementally streaming rdbms data to your data lake automagically
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksSecure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scale
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearch
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptx
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
 

Último

OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisNeo4j
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationHelp Desk Migration
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion Clinic
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfOrtus Solutions, Corp
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabbereGrabber
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfDeskTrack
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareinfo611746
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)Max Lee
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzisteffenkarlsson2
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
 

Último (20)

OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with StrimziStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 

Sparkler at spark summit east 2017

  • 1. Information Retrieval and Data Science Thamme Gowda @thammegowda Karanjeet Singh @_karanjeet A web-crawler on Apache Spark Feb 7-9, 2017Spark Summit East 2017, Boston 1 SPARKLER Dr. Chris Mattmann @chrismattmann https://github.com/USCDataScience/sparkler
  • 2. Information Retrieval and Data Science ABOUT 2 Information Retrieval and Data Science (IRDS) Group University of Southern California, Los Angeles, CA Home page: https://irds.usc.edu Email: irds-L@mymaillists.usc.edu Thamme Gowda Dr. Chris MattmannKaranjeet Singh Graduate Student @thammegowda Graduate Student @karanjeet_tw Director, IRDS @chrismattmann Feb 7-9, 2017Spark Summit East 2017, Boston
  • 3. Information Retrieval and Data Science OVERVIEW ● About Sparkler ● Motivations for building Sparkler ● Sparkler technology stack, internals ● Features of Sparkler ● Dashboard ● Demo ● What’s Next ? 3Feb 7-9, 2017Spark Summit East 2017, Boston
  • 4. Information Retrieval and Data Science ABOUT: SPARKLER ● New Open Source Web Crawler •A bot program that can fetch resources from the web ● Name: Spark Crawler ● Inspired by Apache Nutch ● Like Nutch: Distributed crawler that can scale horizontally ● Unlike Nutch: Runs on top of Apache Spark ● Easy to deploy and easy to use 4Feb 7-9, 2017Spark Summit East 2017, Boston
  • 5. Information Retrieval and Data Science MOTIVATION #1 ● Challenges in DARPA MEMEX* •MEMEX System has crawlers to fetch deep and dark web data •ML based analysis to assist law keeping agencies •Crawls are blackbox, we wanted real-time progress reports ● Dr. Chris Mattmann was considering an upgrade since 3 years ● Technology upgrade needed 5Feb 7-9, 2017Spark Summit East 2017, Boston * http://memex.jpl.nasa.gov/
  • 6. Information Retrieval and Data Science WHY A NEW CRAWLER? 6 Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it! https://twitter.com/cutting/status/796566255830503424 Feb 7-9, 2017Spark Summit East 2017, Boston
  • 7. Information Retrieval and Data Science MOTIVATION #2 ● Challenges at DATOIN •Intro: Datoin.com is a distributed text analytics platform •Late 2014 - migrated the infrastructure from Hadoop Map Reduce to Apache Spark •But the crawler component (powered by Apache Nutch) was left behind ● Met Dr. Chris Mattmann at USC in Web Search Engines class •Enquired about his thoughts for running Nutch on Spark 7Feb 7-9, 2017Spark Summit East 2017, Boston
  • 8. Information Retrieval and Data Science SPARKLER: TECH STACK ● Batch crawling (similar to Apache Nutch) ● Apache Solr as crawl database ● Multi module Maven project with OSGi bundles ● Stream crawled content through Apache Kafka ● Parses everything using Apache Tika ● Crawl visualization - Banana 8Feb 7-9, 2017Spark Summit East 2017, Boston
  • 9. Information Retrieval and Data Science SPARKLER: INTERNALS & WORKFLOW 9Feb 7-9, 2017Spark Summit East 2017, Boston
  • 10. Information Retrieval and Data Science SPARKLER: CRAWLDB 10Feb 7-9, 2017Spark Summit East 2017, Boston
  • 11. Information Retrieval and Data Science SPARKLER: RDD 11Feb 7-9, 2017Spark Summit East 2017, Boston
  • 12. Information Retrieval and Data Science SPARKLER: LINKS PIPELINE 12Feb 7-9, 2017Spark Summit East 2017, Boston
  • 13. Information Retrieval and Data Science SPARKLER: OUTPUT CONSUMPTION 13Feb 7-9, 2017Spark Summit East 2017, Boston
  • 14. Information Retrieval and Data Science SPARKLER: FEATURES 14Feb 7-9, 2017Spark Summit East 2017, Boston
  • 15. Information Retrieval and Data Science SPARKLER #1: Lucene/Solr powered Crawldb ● Crawldb needed indexing •For real time analytics •For instant visualizations ● This is internal data structure of sparkler •Exposed over REST API •Used by Sparkler-ui, the web application ● We chose Apache Solr ● Standalone Solr server or Solr cloud? Yes! ● Glued the crawldb and spark using CrawldbRDD 15Feb 7-9, 2017Spark Summit East 2017, Boston
  • 16. Information Retrieval and Data Science SPARKLER #2: URL Partitioning ● Politeness •Doesn’t hit same server too many times in distributed mode ● First version •Group by: Host name •Sort by: depth, score ● Customization is easy •Write your own Solr query •Take advantage of boosting to alter the ranking ● Partitions the dataset based on the above criteria ● Lazy evaluations and delay between the requests •Performs parsing instead of waiting •Inserts delay only when it is necessary 16Feb 7-9, 2017Spark Summit East 2017, Boston
  • 17. Information Retrieval and Data Science SPARKLER #3: OSGI Plugins ● Plugins Interfaces are inspired by Nutch ● Plugins are developed as per Open Service Gateway Interface (OSGI) ● We chose Apache Felix implementation of OSGI ● Migrated a plugin from Nutch •Regex URL Filter Plugin → The most used plugin in Nutch ● Added JavaScript plugin (described in the next slide) ● //TODO: Migrate more plugins from Nutch •Mavenize nutch [NUTCH-2293] 17Feb 7-9, 2017Spark Summit East 2017, Boston
  • 18. Information Retrieval and Data Science SPARKLER #4: JavaScript Rendering ● Java Script Execution* has first class support •Allows Sparkler to crawl the Deep/Dark web too ● Distributable on Spark Cluster without pain •Pure JVM based JavaScript engine ● This is an implementation of FetchFunction ● FetchFunction •Stream<URL> → Stream<Content> •Note: URLS are grouped by host •Preserves cookies and reuses sessions for each iteration 18 Thanks to: Madhav Sharan Member of USC IRDS* JBrowserDriver by MachinePublishers Feb 7-9, 2017Spark Summit East 2017, Boston
  • 19. Information Retrieval and Data Science SPARKLER #5: Output in Kafka Streams ● Crawler is sometimes input for the applications that does deeper analysis •Can’t fit all those deeper analysis into crawler ● Integrating to such applications made easy via Queues ● We chose Apache Kafka •Suits our need •Distributable, Scalable, Fault Tolerant ● FIXME: Larger messages such as Videos ● This is optional, default output on Shared File System (such as HDFS), compatible with Nutch 19 Thanks to: Rahul Palamuttam MS CS @ Stanford University; Intern @ NASA JPL Feb 7-9, 2017Spark Summit East 2017, Boston
  • 20. Information Retrieval and Data Science SPARKLER #6: Tika, the universal parser ● Apache Tika •Is a toolkit of parsers •Detects and extracts metadata, text, and URLS •Over a thousand different file types ● Main application is to discover outgoing links ● The default Implementation for our ParseFunction 20Feb 7-9, 2017Spark Summit East 2017, Boston
  • 21. Information Retrieval and Data Science SPARKLER #7: Visual Analytics ● Charts and Graphs provides nice summary of crawl job ● Real time analytics ● Example: •Distribution of URLS across hosts/domains •Temporal activities •Status reports ● Customizable in real time ● Using Banana Dashboard from Lucidworks ● Sparkler has a sub component named sparkler-ui 21 Thanks to: Manish Dwibedy MS CS University of Southern California Feb 7-9, 2017Spark Summit East 2017, Boston
  • 22. Information Retrieval and Data Science SPARKLER #7 DASHBOARD 22Feb 7-9, 2017Spark Summit East 2017, Boston
  • 23. Information Retrieval and Data Science SPARKLER #8: Deployment ● Docker ● Juju Charms 23 Thanks to: Tom Barber Spicule Analytics & NASA-JPL Feb 7-9, 2017Spark Summit East 2017, Boston
  • 24. Information Retrieval and Data Science SPARKLER #Next: What’s coming? ● Scoring Crawled Pages (Work in progress) ● Focused Crawling (Work in progress) ● Domain Discovery (Work in progress) ● Detailed documentation and tutorials on wiki (Work in progress) ● Interactive UI ● Crawl Graph Analysis ● Other useful plugins from Nutch 24Feb 7-9, 2017Spark Summit East 2017, Boston Being used for Polar Deep Insights project https://www.earthcube.org/group/polar-data-insights-search-analytics-deep-scientific-web
  • 25. Information Retrieval and Data Science DEMO 25 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston $ bin/dockler.sh
  • 26. Information Retrieval and Data Science QUESTIONS? 26 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston
  • 27. Information Retrieval and Data Science THANK YOU 27 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston