SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
Building a Dataset Search
Engine with Spark and
Elasticsearch
Oscar Castañeda-Villagrán
Xoom a PayPal service
About
• Data Scientist at Xoom a PayPal service.
• Interests:
• Data Management,
• Dataset Search,
• Online Learning to Rank.
Spark cluster …
http://bit.ly/2em6RUK
Spark cluster with …
http://bit.ly/2ebM9HO
Spark cluster with Elasticsearch …
Spark cluster with Elasticsearch Inside
And …
Spark cluster with Elasticsearch Inside
And …
And Indexed …
Spark cluster with Elasticsearch Inside
And Indexed RDatasets
Spark cluster with Elasticsearch Inside
Agenda
• Problem Statement, Overview and Motivation.
• Elasticsearch Server inside Spark Cluster.
• Metadata extraction. (Write to ES + Snapshot index.)
• Demo: RDataset Search Engine with Spark and Elasticsearch.
• Q&A
Problem Statement
• Despite Datasets being a key corporate
asset they are generally not given the
importance they deserve in terms of
making them easy to find.
Questions
• How are Datasets produced?
• Can indexing be performed when
Dataset are produced?
Overview
Rdatasets
Take ES
snapshot
Restore ES snapshot
http://bit.ly/2e5H1jL
Overview: Metadata Extraction
Rdatasets
Data Pipelines:
•Extract: Filename, URLs, Column headers.
•Extract: Type information.
•Index Metadata.
Extract Filename
Extract Type information.
Index metadata.
Extract Column headers
Extract URLs
Overview
Data Lake
Overview
Data Lake
Motivation
• Organizing and indexing Datasets and Data Lake(s).
• Spark with Elasticsearch Inside can be used to index datasets
that are produced as part of Spark data pipelines.
• “Replaying Datasets” in Data Lake(s) can be leveraged to create
a dataset index (a posteriori vs. post hoc (Halevy et al., 2016)).
• ES snapshot is a useful means to deploy Dataset search index in
production (out of Spark with Elasticsearch Inside).
Organizing and Indexing Datasets
Index
dataset
features
Input data
Data Pipeline
Create dataset
profile pages.
Using Tableau Javascript APIES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Search
Fetch dataset
from Datalake.
Dataset profile Team Dashboard
Organizing and Indexing Datasets
Input data
Data Pipeline
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Create dataset
profile pages.
Dataset profile
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Create dataset
profile pages.
Using Tableau Javascript API
Dataset profile
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Create dataset
profile pages.
Using Tableau Javascript API
Dataset profile Team Dashboard
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Input data
ES Cluster
Search
Create dataset
profile pages.
Using Tableau Javascript API
Dataset profile Team Dashboard
Fetch dataset
from Datalake.
Data Pipeline
Index
dataset
features
Organizing and Indexing Datasets
Index
dataset
features
Input data
Data Pipeline
Create dataset
profile pages.
Using Tableau Javascript APIES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Dataset profile Team Dashboard
Search
Fetch dataset
from Datalake.
How do you run Elasticsearch
inside Spark Cluster?
Preliminary Notes
http://bit.ly/2ebM9HO
• Embedded Elasticsearch 5 not supported.
More information:
• https://www.elastic.co/blog/elasticsearch-the-server
• https://github.com/elastic/elasticsearch/issues/19903
Imports
http://bit.ly/2efaib4
http://bit.ly/2di0cFq
http://bit.ly/2ebM9HO
Setup Local ES
server.start()
Metadata Extraction
E.g. do not interpret first line as header
• Only first row
• Split by comma
• Remove additional characters
• Extract column-header names.
Result = Comma separated list of column-header names.
E.g. do interpret first line as header
• run ‘dtypes’ on Data Frame
• Split by comma
• Remove additional characters
• Extract column-header types.
Result = Comma separated list of column-header types.
Writing to local ES
saveToEs("data/set")
Check results on local ES
GET
getUrlAsString(“http://<YOUR-IP-ADDRESS>:9200/_cat/indicies?v”)
Snapshot to S3
Demo!
A posteriori vs. Post-hoc
• Halevy et al (2016) advocate finding data in a
post-hoc manner by collecting and aggregating
metadata after datasets are created or updated.
• I propose an a posteriori approach where
metadata is generated and indexed using Spark
as part of running pipelines.
Pros: A posteriori (1)
• Gathering metadata and recovering dataset
semantics does not require separate processing.
• Dataset normalization to a single “dataset” concept
becomes feasible.
• More granular metrics available to evaluate
metadata regeneration.
Pros: A posteriori (2)
• Granular (per-pipeline) access to recompute dataset
importance based on feedback from user interaction.
• E.g. Datasets with increased user interaction call
for more in-depth metadata extraction.
• Granular (per-pipeline) access to status metadata.
Cons: A posteriori (2)
• Looking at trees instead of the forest.
• Cluster-type (or higher order) analyses limited in-
pipeline (tunnel vision).
• Need to replay indexing pipeline when things
change (per data pipeline).
What have we seen?
• How to create ES Server inside Spark Cluster.
• How to write to Elasticsearch index + snapshot to S3.
• How to extract metadata inside data pipelines.
• Demo: RDataset Search Engine with Spark and
Elasticsearch.
Next Steps (1)
• Replicate on embedded Solr using spark-solr plugin to
read/write from/to Spark RDD’s instead of Elasticsearch.
• Naïvely implement Online Learning to Rank (OLTR)
on Dataset index using Solr Online Learning to Rank Plugin
(Hofmann, 2013) [1].
• Describe Datasets in a structured schema.org way
using Data Catalog Vocabulary [2].
[1] https://bitbucket.org/ilps/solr-online-learning-to-rank-plug-in
[2] http://www.w3.org/TR/vocab-dcat/
Next Steps (2)
• Develop methods to extract relations among Datasets and rank
those using OLTR (interleave (Hofmann, 2013) or multileave (Schuth,
2016)) in place of human relevance judgements (as with featureRank in
(Balakrishnan, et al. 2015)).
• Build a knowledge graph and use GraphX to extract insights. (Useful
e.g. for column concept determination (Deng et al. 2013)).
• Build topic models based on structured Datasets using Glint to
perform scalable topic model extraction in Spark (Jagerman and Eickhoff,
2016) [1].
[1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
References
• Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang.
Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016
International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01,
2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730.
• Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute,
University of Amsterdam, May 2013.
• Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/
1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422.
• Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large
knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf.
• Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to
rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016.
• Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen,
Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
Q&A
Thank You.
Email: ocastaneda@paypal.com
Twitter: @oscar_castaneda

Mais conteúdo relacionado

Mais procurados

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
PLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptxPLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptxjohnwick814916
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationOri Reshef
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
 
Oracle Database in-Memory Overivew
Oracle Database in-Memory OverivewOracle Database in-Memory Overivew
Oracle Database in-Memory OverivewMaria Colgan
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19cMaria Colgan
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 

Mais procurados (20)

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
PLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptxPLPgSqL- Datatypes, Language structure.pptx
PLPgSqL- Datatypes, Language structure.pptx
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Oracle Database in-Memory Overivew
Oracle Database in-Memory OverivewOracle Database in-Memory Overivew
Oracle Database in-Memory Overivew
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 

Destaque

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Spark Summit
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 

Destaque (10)

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 

Semelhante a Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit East talk by Oscar Castaneda Villagran

Learning to Rank Datasets for Search with Oscar Castaneda
Learning to Rank Datasets for Search with Oscar CastanedaLearning to Rank Datasets for Search with Oscar Castaneda
Learning to Rank Datasets for Search with Oscar CastanedaDatabricks
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014nkabra
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 PresentationsAna Rebelo
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Getting Started with Elasticsearch
Getting Started with ElasticsearchGetting Started with Elasticsearch
Getting Started with ElasticsearchAlibaba Cloud
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.Shyjal Raazi
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Nishant Gandhi
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyondErnesto Reig
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 

Semelhante a Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit East talk by Oscar Castaneda Villagran (20)

Learning to Rank Datasets for Search with Oscar Castaneda
Learning to Rank Datasets for Search with Oscar CastanedaLearning to Rank Datasets for Search with Oscar Castaneda
Learning to Rank Datasets for Search with Oscar Castaneda
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Getting Started with Elasticsearch
Getting Started with ElasticsearchGetting Started with Elasticsearch
Getting Started with Elasticsearch
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 

Mais de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mais de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Último (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit East talk by Oscar Castaneda Villagran

  • 1. Building a Dataset Search Engine with Spark and Elasticsearch Oscar Castañeda-Villagrán Xoom a PayPal service
  • 2. About • Data Scientist at Xoom a PayPal service. • Interests: • Data Management, • Dataset Search, • Online Learning to Rank.
  • 4. Spark cluster with … http://bit.ly/2ebM9HO
  • 5. Spark cluster with Elasticsearch …
  • 6. Spark cluster with Elasticsearch Inside And …
  • 7. Spark cluster with Elasticsearch Inside And …
  • 8. And Indexed … Spark cluster with Elasticsearch Inside
  • 9. And Indexed RDatasets Spark cluster with Elasticsearch Inside
  • 10. Agenda • Problem Statement, Overview and Motivation. • Elasticsearch Server inside Spark Cluster. • Metadata extraction. (Write to ES + Snapshot index.) • Demo: RDataset Search Engine with Spark and Elasticsearch. • Q&A
  • 11. Problem Statement • Despite Datasets being a key corporate asset they are generally not given the importance they deserve in terms of making them easy to find.
  • 12. Questions • How are Datasets produced? • Can indexing be performed when Dataset are produced?
  • 13. Overview Rdatasets Take ES snapshot Restore ES snapshot http://bit.ly/2e5H1jL
  • 14. Overview: Metadata Extraction Rdatasets Data Pipelines: •Extract: Filename, URLs, Column headers. •Extract: Type information. •Index Metadata. Extract Filename Extract Type information. Index metadata. Extract Column headers Extract URLs
  • 17. Motivation • Organizing and indexing Datasets and Data Lake(s). • Spark with Elasticsearch Inside can be used to index datasets that are produced as part of Spark data pipelines. • “Replaying Datasets” in Data Lake(s) can be leveraged to create a dataset index (a posteriori vs. post hoc (Halevy et al., 2016)). • ES snapshot is a useful means to deploy Dataset search index in production (out of Spark with Elasticsearch Inside).
  • 18. Organizing and Indexing Datasets Index dataset features Input data Data Pipeline Create dataset profile pages. Using Tableau Javascript APIES Cluster Fetch dataset from Datalake and run Data Pipeline on demand. Search Fetch dataset from Datalake. Dataset profile Team Dashboard
  • 19. Organizing and Indexing Datasets Input data Data Pipeline
  • 20. Organizing and Indexing Datasets Input data ES Cluster Search Data Pipeline Index dataset features
  • 21. Organizing and Indexing Datasets Input data ES Cluster Search Create dataset profile pages. Dataset profile Data Pipeline Index dataset features
  • 22. Organizing and Indexing Datasets Input data ES Cluster Search Create dataset profile pages. Using Tableau Javascript API Dataset profile Data Pipeline Index dataset features
  • 23. Organizing and Indexing Datasets Input data ES Cluster Search Create dataset profile pages. Using Tableau Javascript API Dataset profile Team Dashboard Data Pipeline Index dataset features
  • 24. Organizing and Indexing Datasets Input data ES Cluster Search Create dataset profile pages. Using Tableau Javascript API Dataset profile Team Dashboard Fetch dataset from Datalake. Data Pipeline Index dataset features
  • 25. Organizing and Indexing Datasets Index dataset features Input data Data Pipeline Create dataset profile pages. Using Tableau Javascript APIES Cluster Fetch dataset from Datalake and run Data Pipeline on demand. Dataset profile Team Dashboard Search Fetch dataset from Datalake.
  • 26. How do you run Elasticsearch inside Spark Cluster?
  • 27. Preliminary Notes http://bit.ly/2ebM9HO • Embedded Elasticsearch 5 not supported. More information: • https://www.elastic.co/blog/elasticsearch-the-server • https://github.com/elastic/elasticsearch/issues/19903
  • 30. Metadata Extraction E.g. do not interpret first line as header • Only first row • Split by comma • Remove additional characters • Extract column-header names. Result = Comma separated list of column-header names. E.g. do interpret first line as header • run ‘dtypes’ on Data Frame • Split by comma • Remove additional characters • Extract column-header types. Result = Comma separated list of column-header types.
  • 31. Writing to local ES saveToEs("data/set")
  • 32. Check results on local ES GET getUrlAsString(“http://<YOUR-IP-ADDRESS>:9200/_cat/indicies?v”)
  • 34. Demo!
  • 35. A posteriori vs. Post-hoc • Halevy et al (2016) advocate finding data in a post-hoc manner by collecting and aggregating metadata after datasets are created or updated. • I propose an a posteriori approach where metadata is generated and indexed using Spark as part of running pipelines.
  • 36. Pros: A posteriori (1) • Gathering metadata and recovering dataset semantics does not require separate processing. • Dataset normalization to a single “dataset” concept becomes feasible. • More granular metrics available to evaluate metadata regeneration.
  • 37. Pros: A posteriori (2) • Granular (per-pipeline) access to recompute dataset importance based on feedback from user interaction. • E.g. Datasets with increased user interaction call for more in-depth metadata extraction. • Granular (per-pipeline) access to status metadata.
  • 38. Cons: A posteriori (2) • Looking at trees instead of the forest. • Cluster-type (or higher order) analyses limited in- pipeline (tunnel vision). • Need to replay indexing pipeline when things change (per data pipeline).
  • 39. What have we seen? • How to create ES Server inside Spark Cluster. • How to write to Elasticsearch index + snapshot to S3. • How to extract metadata inside data pipelines. • Demo: RDataset Search Engine with Spark and Elasticsearch.
  • 40. Next Steps (1) • Replicate on embedded Solr using spark-solr plugin to read/write from/to Spark RDD’s instead of Elasticsearch. • Naïvely implement Online Learning to Rank (OLTR) on Dataset index using Solr Online Learning to Rank Plugin (Hofmann, 2013) [1]. • Describe Datasets in a structured schema.org way using Data Catalog Vocabulary [2]. [1] https://bitbucket.org/ilps/solr-online-learning-to-rank-plug-in [2] http://www.w3.org/TR/vocab-dcat/
  • 41. Next Steps (2) • Develop methods to extract relations among Datasets and rank those using OLTR (interleave (Hofmann, 2013) or multileave (Schuth, 2016)) in place of human relevance judgements (as with featureRank in (Balakrishnan, et al. 2015)). • Build a knowledge graph and use GraphX to extract insights. (Useful e.g. for column concept determination (Deng et al. 2013)). • Build topic models based on structured Datasets using Glint to perform scalable topic model extraction in Spark (Jagerman and Eickhoff, 2016) [1]. [1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
  • 42. References • Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730. • Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute, University of Amsterdam, May 2013. • Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/ 1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422. • Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf. • Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016. • Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
  • 43. Q&A