SlideShare uma empresa Scribd logo
1 de 57
Baixar para ler offline
Oscar Castañeda, Xoom a PayPal Service
Learning to Rank Datasets
for Search
#SAISDS8
#SAISDS8
About
• Data Scientist at Xoom a PayPal service.
• Interests:
• Data Management,
• Dataset Search,
• Learning to Rank.
2
Spark cluster with Elasticsearch
http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO
And Indexed RDatasets
Spark cluster with Elasticsearch Inside
5
Learning to Rank Datasets
#SAISDS8
6
Learning to Rank Datasets
for Search!
#SAISDS8
7
Agenda
• Problem Statement and Motivation
• Elasticsearch Learning to Rank
• Data Pipeline: metadata extraction, judgement list extraction
• Demo: Beginnings of a Dataset Search Engine with Machine-
learned relevance ranking.
• Q&A
#SAISDS8
8
Problem Statement (1)
• Despite datasets being a key corporate
asset they are generally not given the
importance they deserve and as a result
they are hard to find.
#SAISDS8
9
Problem Statement (2)
• Specifically, teams within organizations
have a hard time finding datasets
relevant to their function.
#SAISDS8
10
Topics
• Indexing
#SAISDS8
11
Topics
• Indexing (Spark Summit East 2017).
#SAISDS8
12
Topics
• Indexing (Spark Summit East 2017).
• Ranking
#SAISDS8
13
Topics
• Indexing (Spark Summit East 2017).
• Ranking => today’s topic!
#SAISDS8
14
Questions
• How are datasets ranked?
• Can judgement lists (useful for ranking)
be generated at dataset production
time?
#SAISDS8
15
Overview
Rdatasets
Take ES
snapshot
Restore ES snapshot
http://bit.ly/2e5H1jL
#SAISDS8
16
Overview
Rdatasets
Data Pipelines:
•Extract Filename, Format, Field description.
•Extract Type information.
•Index CSV files.
Extract Filename
Extract Type information.
Index CSV files.
Extract Format
Extract Field description
#SAISDS8
17
Overview
Data Lake
#SAISDS8
18
Overview
Data Lake
#SAISDS8
19
Motivation (1)
• Organizing, indexing and ranking Datasets:
#SAISDS8
20
Motivation (1)
• Organizing, indexing and ranking Datasets:
• Produced by individual data pipelines
• On Data Lake(s)
#SAISDS8
21
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
#SAISDS8
22
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
#SAISDS8
23
Motivation (2)
• Produce a ranking function for datasets that are generated as
part of running data pipelines.
• Extract “relevance judgements” and use them to bootstrap a
dataset rank model (a posteriori vs. post hoc (Halevy et al.,
2016)).
• In a feedback loop leveraging click-through data on dataset
profile pages.
#SAISDS8
24
Organizing, Indexing and Ranking Datasets
Index dataset
features
Input data
Data Pipeline
Create
dataset
profile
pages.
Using Tableau Javascript API
ES Cluster
Fetch dataset from Datalake and run Data Pipeline on demand.
Search
Fetch dataset
from Datalake.
Dataset profile Team Dashboard
Feedback
logging
Relevance
judgements
#SAISDS8
25
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
#SAISDS8
Search
26
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
#SAISDS8
Index dataset
features
Search
27
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
Index dataset
features
Search
28
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
Index dataset
features
Search
29
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Feedback
logging
Relevance
judgements
#SAISDS8
New
New
Search
30
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile
#SAISDS8
Search
31
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
#SAISDS8
Search
32
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
Fetch dataset from Datalake and run Data Pipeline on demand.
Fetch dataset
from Datalake.
#SAISDS8
Search
33
Organizing, Indexing and Ranking Datasets
Input data
Data Pipeline
ES Cluster
Index dataset
features
Feedback
logging
Relevance
judgements
Create
dataset
profile
pages.
Dataset profile Team Dashboard
Fetch dataset from Datalake and run Data Pipeline on demand.
Fetch dataset
from Datalake.
#SAISDS8
34
How do you rank datasets?
#SAISDS8
Ranking datasets
35
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
#SAISDS8
Ranking datasets
36
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
#SAISDS8
Ranking datasets
37
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
#SAISDS8
Ranking datasets
38
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
• And to bootstrap a dataset rank model
#SAISDS8
Ranking datasets
39
• Extraction of “relevance judgements” can be built into data pipeline
for specific datasets immediately after they are generated.
• Used to produce a ranking function for datasets.
• Leveraged for training
• And to bootstrap a dataset rank model
• (a posteriori vs. post hoc (Halevy et al., 2016).)
#SAISDS8
Ranking datasets
40
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
#SAISDS8
Ranking datasets
41
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Ranking datasets
42
• Click-through data provides implicit feedback useful to adjust initial
relevance judgements.
• Leveraging click-through data on dataset profile pages.
#SAISDS8
Create
dataset
profile
pages.
Dataset profile
43
• Alon et al (2016) advocate finding data in a post-
hoc manner by collecting and aggregating
metadata after datasets are created or updated.
• We propose a so-called “a posteriori” approach
where metadata is generated as part of running
pipelines using Spark.
A posteriori vs. Post-hoc
#SAISDS8
44
• Alon et al (2016) advocate indexing
• We prefer indexing
A posteriori vs. Post-hoc
#SAISDS8
after the fact
immediately after the fact
45
Pros
• “Relevance judgements” can be extracted and
leveraged to bootstrap a ranked dataset index.
• In a feedback loop leveraging click-through data on
Dataset profile pages.
• More granular metrics available to evaluate
metadata regeneration.
#SAISDS8
immediately after the fact
46
Cons
• Offline model development is disconnected and only
indirectly part of feedback using click-through data.
• Looking at trees instead of the forest.
• Need to replay indexing pipeline when things change
(per data pipeline).
#SAISDS8
immediately after the fact
47#SAISDS8
Demo!
48#SAISDS8
Demo Scenario
• Movies represent datasets
• TMDB movie pages represent dataset profile
pages.
• Marketing team (also called WAR team) interested
in War movies (datasets).
49#SAISDS8
Movie 1
https://www.themoviedb.org/movie/7555-rambo
50#SAISDS8
Movie 2
https://www.themoviedb.org/movie/1370-rambo-iii
51#SAISDS8
Movie 3
https://www.themoviedb.org/movie/1369-rambo-first-blood-part-ii
52#SAISDS8
judgement file
Rambo
Rambo III
Rambo: First Blood Part II
53
What have we seen?
• How to rank datasets on Elasticsearch using LTR.
• Extract relevance judgements immediately after
datasets are generated in Spark.
• Demo: Dataset Search with Spark and
Elasticsearch LTR.
#SAISDS8
54
Next Steps (1)
• Describe Datasets in a structured schema.org way
using Data Catalog Vocabulary [2].
• Build a knowledge graph and use GraphX to extract insights.
(Useful e.g. for column concept determination (Deng et al. 2013)).
• Build topic models based on structured Datasets using
Glint to perform scalable topic model extraction in Spark
(Jagerman and Eickhoff, 2016) [1].
[1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/
#SAISDS8
55
References
• Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang.
Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016
International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01,
2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730.
• Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute,
University of Amsterdam, May 2013.
• Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/
1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422.
• Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large
knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf.
• Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to
rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016.
• Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen,
Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on
Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015.
#SAISDS8
56
Q&A
#SAISDS8
Thank You.
Email: ocastaneda@paypal.com
Twitter: @oscar_castaneda
#SAISDS8

Mais conteúdo relacionado

Mais procurados

Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
Chris Johnson
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Sease
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
Sumit Rangwala
 

Mais procurados (20)

ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
 
Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
 
Integrating Clickstream Data into Solr for Ranking and Dynamic Facet Optimiza...
Integrating Clickstream Data into Solr for Ranking and Dynamic Facet Optimiza...Integrating Clickstream Data into Solr for Ranking and Dynamic Facet Optimiza...
Integrating Clickstream Data into Solr for Ranking and Dynamic Facet Optimiza...
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Learning to rank
Learning to rankLearning to rank
Learning to rank
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Deep Learning for Semantic Search in E-commerce​
Deep Learning for Semantic Search in E-commerce​Deep Learning for Semantic Search in E-commerce​
Deep Learning for Semantic Search in E-commerce​
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur Datar
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
Near RealTime search @Flipkart
Near RealTime search @FlipkartNear RealTime search @Flipkart
Near RealTime search @Flipkart
 
Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018Tutorial on sequence aware recommender systems - UMAP 2018
Tutorial on sequence aware recommender systems - UMAP 2018
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 

Semelhante a Learning to Rank Datasets for Search with Oscar Castaneda

Semelhante a Learning to Rank Datasets for Search with Oscar Castaneda (20)

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
DBtrends Semantics 2016
DBtrends Semantics 2016DBtrends Semantics 2016
DBtrends Semantics 2016
 
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)
 
EpiServer find Macaw
EpiServer find MacawEpiServer find Macaw
EpiServer find Macaw
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detection
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
 
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
Uncover Your Data Journey: End-To-End Data Lineage For SAP BOBJ And SAP Data ...
 
The power of polyglot searching
The power of polyglot searchingThe power of polyglot searching
The power of polyglot searching
 
Power of Polyglot Search
Power of Polyglot SearchPower of Polyglot Search
Power of Polyglot Search
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To Know
 
The BI Sandbox
The BI SandboxThe BI Sandbox
The BI Sandbox
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Último (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

Learning to Rank Datasets for Search with Oscar Castaneda

  • 1. Oscar Castañeda, Xoom a PayPal Service Learning to Rank Datasets for Search #SAISDS8
  • 2. #SAISDS8 About • Data Scientist at Xoom a PayPal service. • Interests: • Data Management, • Dataset Search, • Learning to Rank. 2
  • 3. Spark cluster with Elasticsearch http://bit.ly/2em6RUKhttp://bit.ly/2ebM9HO
  • 4. And Indexed RDatasets Spark cluster with Elasticsearch Inside
  • 5. 5 Learning to Rank Datasets #SAISDS8
  • 6. 6 Learning to Rank Datasets for Search! #SAISDS8
  • 7. 7 Agenda • Problem Statement and Motivation • Elasticsearch Learning to Rank • Data Pipeline: metadata extraction, judgement list extraction • Demo: Beginnings of a Dataset Search Engine with Machine- learned relevance ranking. • Q&A #SAISDS8
  • 8. 8 Problem Statement (1) • Despite datasets being a key corporate asset they are generally not given the importance they deserve and as a result they are hard to find. #SAISDS8
  • 9. 9 Problem Statement (2) • Specifically, teams within organizations have a hard time finding datasets relevant to their function. #SAISDS8
  • 11. 11 Topics • Indexing (Spark Summit East 2017). #SAISDS8
  • 12. 12 Topics • Indexing (Spark Summit East 2017). • Ranking #SAISDS8
  • 13. 13 Topics • Indexing (Spark Summit East 2017). • Ranking => today’s topic! #SAISDS8
  • 14. 14 Questions • How are datasets ranked? • Can judgement lists (useful for ranking) be generated at dataset production time? #SAISDS8
  • 15. 15 Overview Rdatasets Take ES snapshot Restore ES snapshot http://bit.ly/2e5H1jL #SAISDS8
  • 16. 16 Overview Rdatasets Data Pipelines: •Extract Filename, Format, Field description. •Extract Type information. •Index CSV files. Extract Filename Extract Type information. Index CSV files. Extract Format Extract Field description #SAISDS8
  • 19. 19 Motivation (1) • Organizing, indexing and ranking Datasets: #SAISDS8
  • 20. 20 Motivation (1) • Organizing, indexing and ranking Datasets: • Produced by individual data pipelines • On Data Lake(s) #SAISDS8
  • 21. 21 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. #SAISDS8
  • 22. 22 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. • Extract “relevance judgements” and use them to bootstrap a dataset rank model (a posteriori vs. post hoc (Halevy et al., 2016)). #SAISDS8
  • 23. 23 Motivation (2) • Produce a ranking function for datasets that are generated as part of running data pipelines. • Extract “relevance judgements” and use them to bootstrap a dataset rank model (a posteriori vs. post hoc (Halevy et al., 2016)). • In a feedback loop leveraging click-through data on dataset profile pages. #SAISDS8
  • 24. 24 Organizing, Indexing and Ranking Datasets Index dataset features Input data Data Pipeline Create dataset profile pages. Using Tableau Javascript API ES Cluster Fetch dataset from Datalake and run Data Pipeline on demand. Search Fetch dataset from Datalake. Dataset profile Team Dashboard Feedback logging Relevance judgements #SAISDS8
  • 25. 25 Organizing, Indexing and Ranking Datasets Input data Data Pipeline #SAISDS8
  • 26. Search 26 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features #SAISDS8
  • 27. Index dataset features Search 27 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8
  • 28. Index dataset features Search 28 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8 New
  • 29. Index dataset features Search 29 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Feedback logging Relevance judgements #SAISDS8 New New
  • 30. Search 30 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile #SAISDS8
  • 31. Search 31 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard #SAISDS8
  • 32. Search 32 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard Fetch dataset from Datalake and run Data Pipeline on demand. Fetch dataset from Datalake. #SAISDS8
  • 33. Search 33 Organizing, Indexing and Ranking Datasets Input data Data Pipeline ES Cluster Index dataset features Feedback logging Relevance judgements Create dataset profile pages. Dataset profile Team Dashboard Fetch dataset from Datalake and run Data Pipeline on demand. Fetch dataset from Datalake. #SAISDS8
  • 34. 34 How do you rank datasets? #SAISDS8
  • 35. Ranking datasets 35 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. #SAISDS8
  • 36. Ranking datasets 36 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. #SAISDS8
  • 37. Ranking datasets 37 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training #SAISDS8
  • 38. Ranking datasets 38 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training • And to bootstrap a dataset rank model #SAISDS8
  • 39. Ranking datasets 39 • Extraction of “relevance judgements” can be built into data pipeline for specific datasets immediately after they are generated. • Used to produce a ranking function for datasets. • Leveraged for training • And to bootstrap a dataset rank model • (a posteriori vs. post hoc (Halevy et al., 2016).) #SAISDS8
  • 40. Ranking datasets 40 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. #SAISDS8
  • 41. Ranking datasets 41 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. • Leveraging click-through data on dataset profile pages. #SAISDS8
  • 42. Ranking datasets 42 • Click-through data provides implicit feedback useful to adjust initial relevance judgements. • Leveraging click-through data on dataset profile pages. #SAISDS8 Create dataset profile pages. Dataset profile
  • 43. 43 • Alon et al (2016) advocate finding data in a post- hoc manner by collecting and aggregating metadata after datasets are created or updated. • We propose a so-called “a posteriori” approach where metadata is generated as part of running pipelines using Spark. A posteriori vs. Post-hoc #SAISDS8
  • 44. 44 • Alon et al (2016) advocate indexing • We prefer indexing A posteriori vs. Post-hoc #SAISDS8 after the fact immediately after the fact
  • 45. 45 Pros • “Relevance judgements” can be extracted and leveraged to bootstrap a ranked dataset index. • In a feedback loop leveraging click-through data on Dataset profile pages. • More granular metrics available to evaluate metadata regeneration. #SAISDS8 immediately after the fact
  • 46. 46 Cons • Offline model development is disconnected and only indirectly part of feedback using click-through data. • Looking at trees instead of the forest. • Need to replay indexing pipeline when things change (per data pipeline). #SAISDS8 immediately after the fact
  • 48. 48#SAISDS8 Demo Scenario • Movies represent datasets • TMDB movie pages represent dataset profile pages. • Marketing team (also called WAR team) interested in War movies (datasets).
  • 53. 53 What have we seen? • How to rank datasets on Elasticsearch using LTR. • Extract relevance judgements immediately after datasets are generated in Spark. • Demo: Dataset Search with Spark and Elasticsearch LTR. #SAISDS8
  • 54. 54 Next Steps (1) • Describe Datasets in a structured schema.org way using Data Catalog Vocabulary [2]. • Build a knowledge graph and use GraphX to extract insights. (Useful e.g. for column concept determination (Deng et al. 2013)). • Build topic models based on structured Datasets using Glint to perform scalable topic model extraction in Spark (Jagerman and Eickhoff, 2016) [1]. [1] https://spark-summit.org/eu-2016/events/glint-an-asynchronous-parameter-server-for-spark/ #SAISDS8
  • 55. 55 References • Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. Goods: Organizing google’s datasets. In Fatmañzcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 795–806. ACM, 2016. ISBN 978-1-4503-3531-7. doi: http://doi.acm.org/10.1145/2882903.2903730. • Katja Hofmann. Fast and Reliable Online Learning to Rank for Information Retrieval. PhD thesis, Informatics Institute, University of Amsterdam, May 2013. • Rolf Jagerman and Carsten Eickhoff. Web-scale topic models in spark: An asynchronous parameter server. CoRR, abs/ 1605.07422, 2016. URL http://arxiv.org/abs/1605. 07422. • Dong Deng, Yu Jiang, Guoliang Li, Jian Li, and Cong Yu. Scalable column concept de- termination for web tables using large knowledge bases. PVLDB, 6(13):1606–1617, 2013. doi: http://www.vldb.org/pvldb/vol6/p1606-li.pdf. • Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. Multileave gradient descent for fast online learning to rank. In WSDM 2016: The 9th International Conference on Web Search and Data Mining, pages 457-466. ACM, February 2016. • Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu 0003, and Cong Yu. Ap- plying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org, 2015. #SAISDS8