SlideShare uma empresa Scribd logo
1 de 74
Baixar para ler offline
Making Big Data Easy
Data Discovery Lineage Data Like Code
May 15th 2019
Phil Mizrahi | @philippemizrahi | Associate Product Manager, Lyft
Jin Hyuk Chang | @jinhyukchang | Software Engineer, Lyft
go.lyft.com/datadiscoveryslides
Disrupting Data Discovery
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Amundsen’s Architecture
• What’s next?
3
Data at Lyft
4
5
Core Infra high level architecture
Custom apps
Challenges with Data
Discovery
6
Data is used to make informed decisions
7
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-based decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision
Make data the heart of every decision
• My first project is to predict the Data Meetup Attendance
• Goal: help the office team make a decision on # of chairs to provide
Hi! I am a n00b Data Scientist!
8
• Ask a friend/boss/coworker
• Ask in a wider slack channel
• Search in the Github repos
Step 1: Search & find data
9
We end up finding a table: hosted_events
that seems to be the right one
• What does this field mean?
‒ Does hosted_events.attendance data include employees?
‒ What does hosted_events.costs include?
• Dig deeper: explore using SQL queries
Step 2: Understand the data
10
SELECT *
FROM default.hosted_events
WHERE ds=’2019-05-15’
LIMIT 100;
Data Scientists spend upto 1/3rd time in Data Discovery
11
Data Discovery
• Data discovery is a problem
because of the lack of understanding
of what data exists, where, who owns
it, how to use it,...
• It is not what our data scientist
should focus on: they should focus
on Analysis work
Data-based decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision
Audience for data
discovery
12
User Personas - (1/2)
13
Analysts Data Scientists General
Managers
ExperimentersEngineersProduct
Managers
• Frequent use of data
• Deep to very deep analysis
• Exposure to new datasets
• Creating insights & developing
models
User Personas - (2/2)
14
Power user
- Has been at Lyft for a long
time
- Knows the data environment
well: where to find data, what
it means, how to use it
Pain points:
- Needs to spend a fair amount
of his time sharing his
knowledge with the noob user
- Could become noob user if he
switches teams
Noob user
- Recently joined Lyft
- Needs to ramp up on a lot of
things, wants to start having
impact soon
Pain points:
- Doesn’t know where to start.
Spends his time asking
questions and cmd+F on
github
- Makes mistakes by mis-using
some datasets
3 complementary ways to do Data Discovery
15
Search based
I am looking for a table with data on “cancel rates”
- Where is the table?
- What does it contain?
- Has the analysis I want to perform already been done?
Lineage based
If this event is down, what datasets are going to be impacted?
- Upstream/downstream lineage
- Incidents, SLA misses, Data quality
Network based
I want to check what tables my manager uses
- Ownership information
- Bookmarking
- Usage through query logs
Data discovery at Lyft
16
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Landing page optimized for search
Search results ranked on relevance and query activity
How does search work?
19
Relevance - search for “apple” on Google
20
Low relevance High relevance
Popularity - search for “apple” on Google
21
Low popularity High popularity
Striking the balance
22
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Querying activity
● Dashboarding
● Different weights for automated vs adhoc
querying
Demo
23
Amundsen’s architecture
24
25
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Metadata Service
26
27
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Amundsen table detail page
Challenge #1
Why choose Graph
database?
29
30
Why Graph database? (1/2)
31
Why Graph database? (2/2)
32
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Challenge #2
Why not propagate the
metadata back to source?
33
Why not propagate the metadata back to source
34
2. Databuilder
35
36
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
Challenge #3
Various forms of metadata
37
38
Metadata Sources @ Lyft
Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
39
Databuilder
40
Databuilder in action
41
How is the databuilder orchestrated?
42
Amundsen uses Apache Airflow to orchestrate Databuilder jobs
3. Search service
43
44
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
45
Challenge #4
How to make the search
result more relevant?
46
How to make the search result more relevant?
47
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
4. Frontend service
48
49
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Search results ranked on relevance and query activity
Amundsen table detail page
What’s next?
52
Amundsen’s impact
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
‒ 90% penetration among Data Scientists
‒ +30% productivity for the Data science org.
53
Amundsen is Open Source!
• Many organizations have similar problems
‒ github.com/lyft/amundsenfrontendlibrary
• Collaboration with outside companies has started
‒ Used in Production by ING
‒ Working with WeWork, Square, Bang & Olufsen & more
54
Roadmap
PeopleDashboards
Data sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
More
Metadata
Deeper integration with other
tools (e.g. Superset)
Privacy Governance
Summary
56
Summary
• A metadata-powered discovery tool - like Amundsen - is key to the next
wave of big data applications
• Blog post with more details: go.lyft.com/datadiscoveryblog
• Join the community: github.com/lyft/amundsenfrontendlibrary
57
Phil Mizrahi | @philippemizrahi
Jin Hyuk Chang | @jinhyukchang
Repo at github.com/lyft/amundsenfrontendlibrary
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/ 58
Backup
59
60
Lyft Data Team
Lyft Data Team
Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
All about metadata
61
Serving more metadata about existing resources
Application Context
Existence, description, semantics, etc.
Behavior
How data is created and used over time
Change
How data is changing over time
Metadata: a set of data that describes and gives information about other data
Ground, Joe Hellerstein, Vikram Sreekanti et al.
RISE Lab, UC Berkeley
Buy vs. Build vs. Adopt
63
Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
Back to mocks...
65
Search results ranked on relevance and query activity
Detailed description and metadata about data resources
Data Preview within the tool
Computed stats about column metadata
Disclaimer: these stats are arbitrary.
Built-in user feedback
Challenge #2
Pull model vs Push model
71
Pull model and Push model
72
Pull Model Push Model
● Amundsen goes to the source and
periodically update the index by pulling from
the source (e.g. database).
● Very efficient on popular platform. Hard to
be extended.
● The external source (e.g. database) pushes
metadata to a message bus which
Amundsen subscribes to.
● Flexible and extensible. Lots of work from
external source.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
We’re Hiring! Apply at www.lyft.com/careers
or email data-recruiting@lyft.com
Data Engineering
Engineering Manager
San Francisco
Software Engineer
San Francisco, Seattle, &
New York City
Data Infrastructure
Engineering Manager
San Francisco
Software Engineer
San Francisco & Seattle
Experimentation
Software Engineer
San Francisco
Streaming
Software Engineer
San Francisco
Observability
Software Engineer
San Francisco
Strata SF 2019
Rate this session
session page on conference website
O’Reilly Events App

Mais conteúdo relacionado

Mais procurados

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query UnderstandingDaniel Tunkelang
 
Graph Computing with JanusGraph
Graph Computing with JanusGraphGraph Computing with JanusGraph
Graph Computing with JanusGraphJason Plurad
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...aiuy
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino ProjectMartin Traverso
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotXiang Fu
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthDatabricks
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesNeo4j
 

Mais procurados (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query Understanding
 
Graph Computing with JanusGraph
Graph Computing with JanusGraphGraph Computing with JanusGraph
Graph Computing with JanusGraph
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 

Semelhante a Meetup SF - Amundsen

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustPeter Skomoroch
 
DMPTool for IMLS #WebWise14
DMPTool for IMLS #WebWise14DMPTool for IMLS #WebWise14
DMPTool for IMLS #WebWise14Carly Strasser
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Demystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDemystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDataWorks Summit
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with DatabricksGrega Kespret
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisOpen Analytics
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 

Semelhante a Meetup SF - Amundsen (20)

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
 
DMPTool for IMLS #WebWise14
DMPTool for IMLS #WebWise14DMPTool for IMLS #WebWise14
DMPTool for IMLS #WebWise14
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Demystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDemystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time Analytics
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 

Último

12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdftpo482247
 
NIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfNIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfMohonDas
 
zomato data mining datasets for quality prefernece and conntrol.pptx
zomato data mining  datasets for quality prefernece and conntrol.pptxzomato data mining  datasets for quality prefernece and conntrol.pptx
zomato data mining datasets for quality prefernece and conntrol.pptxPratikMhatre39
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting AlgorithmsAshutosh Satapathy
 
Field Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
Field Report on present condition of Ward 1 and Ward 2 of Pabna MunicipalityField Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
Field Report on present condition of Ward 1 and Ward 2 of Pabna MunicipalityMorshed Ahmed Rahath
 
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringC Sai Kiran
 
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptxssuser886c55
 
First Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideFirst Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideMonika860882
 
A brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationA brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationJeyporess2021
 
Support nodes for large-span coal storage structures
Support nodes for large-span coal storage structuresSupport nodes for large-span coal storage structures
Support nodes for large-span coal storage structureswendy cai
 
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024California Asphalt Pavement Association
 
Injection Power Cycle - The most efficient power cycle
Injection Power Cycle - The most efficient power cycleInjection Power Cycle - The most efficient power cycle
Injection Power Cycle - The most efficient power cyclemarijomiljkovic1
 
autonomous_vehicle_working_paper_01072020-_508_compliant.pdf
autonomous_vehicle_working_paper_01072020-_508_compliant.pdfautonomous_vehicle_working_paper_01072020-_508_compliant.pdf
autonomous_vehicle_working_paper_01072020-_508_compliant.pdfPandurangGurakhe
 
presentation by faizan[1] [Read-Only].pptx
presentation by faizan[1] [Read-Only].pptxpresentation by faizan[1] [Read-Only].pptx
presentation by faizan[1] [Read-Only].pptxkhfaizan534
 
Road to GDSC (Become the next GDSC lead)
Road to GDSC (Become the next GDSC lead)Road to GDSC (Become the next GDSC lead)
Road to GDSC (Become the next GDSC lead)GDSCNiT
 
Investigating the Efficiency of Drinking Water Treatment Sludge and Iron-Base...
Investigating the Efficiency of Drinking Water Treatment Sludge and Iron-Base...Investigating the Efficiency of Drinking Water Treatment Sludge and Iron-Base...
Investigating the Efficiency of Drinking Water Treatment Sludge and Iron-Base...J. Agricultural Machinery
 
Advanced Additive Manufacturing by Sumanth A.pptx
Advanced Additive Manufacturing by Sumanth A.pptxAdvanced Additive Manufacturing by Sumanth A.pptx
Advanced Additive Manufacturing by Sumanth A.pptxSumanth A
 
Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...soginsider
 

Último (20)

12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf12. Stairs by U Nyi Hla ngae from Myanmar.pdf
12. Stairs by U Nyi Hla ngae from Myanmar.pdf
 
NIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfNIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdf
 
zomato data mining datasets for quality prefernece and conntrol.pptx
zomato data mining  datasets for quality prefernece and conntrol.pptxzomato data mining  datasets for quality prefernece and conntrol.pptx
zomato data mining datasets for quality prefernece and conntrol.pptx
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting Algorithms
 
Field Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
Field Report on present condition of Ward 1 and Ward 2 of Pabna MunicipalityField Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
Field Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
 
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
 
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
 
First Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideFirst Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slide
 
A brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationA brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station Presentation
 
Support nodes for large-span coal storage structures
Support nodes for large-span coal storage structuresSupport nodes for large-span coal storage structures
Support nodes for large-span coal storage structures
 
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
 
Injection Power Cycle - The most efficient power cycle
Injection Power Cycle - The most efficient power cycleInjection Power Cycle - The most efficient power cycle
Injection Power Cycle - The most efficient power cycle
 
Update on the latest research with regard to RAP
Update on the latest research with regard to RAPUpdate on the latest research with regard to RAP
Update on the latest research with regard to RAP
 
autonomous_vehicle_working_paper_01072020-_508_compliant.pdf
autonomous_vehicle_working_paper_01072020-_508_compliant.pdfautonomous_vehicle_working_paper_01072020-_508_compliant.pdf
autonomous_vehicle_working_paper_01072020-_508_compliant.pdf
 
presentation by faizan[1] [Read-Only].pptx
presentation by faizan[1] [Read-Only].pptxpresentation by faizan[1] [Read-Only].pptx
presentation by faizan[1] [Read-Only].pptx
 
Road to GDSC (Become the next GDSC lead)
Road to GDSC (Become the next GDSC lead)Road to GDSC (Become the next GDSC lead)
Road to GDSC (Become the next GDSC lead)
 
Hydraulic Loading System - Neometrix Defence
Hydraulic Loading System - Neometrix DefenceHydraulic Loading System - Neometrix Defence
Hydraulic Loading System - Neometrix Defence
 
Investigating the Efficiency of Drinking Water Treatment Sludge and Iron-Base...
Investigating the Efficiency of Drinking Water Treatment Sludge and Iron-Base...Investigating the Efficiency of Drinking Water Treatment Sludge and Iron-Base...
Investigating the Efficiency of Drinking Water Treatment Sludge and Iron-Base...
 
Advanced Additive Manufacturing by Sumanth A.pptx
Advanced Additive Manufacturing by Sumanth A.pptxAdvanced Additive Manufacturing by Sumanth A.pptx
Advanced Additive Manufacturing by Sumanth A.pptx
 
Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...
 

Meetup SF - Amundsen

  • 1. Making Big Data Easy Data Discovery Lineage Data Like Code
  • 2. May 15th 2019 Phil Mizrahi | @philippemizrahi | Associate Product Manager, Lyft Jin Hyuk Chang | @jinhyukchang | Software Engineer, Lyft go.lyft.com/datadiscoveryslides Disrupting Data Discovery
  • 3. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Amundsen’s Architecture • What’s next? 3
  • 5. 5 Core Infra high level architecture Custom apps
  • 7. Data is used to make informed decisions 7 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision Make data the heart of every decision
  • 8. • My first project is to predict the Data Meetup Attendance • Goal: help the office team make a decision on # of chairs to provide Hi! I am a n00b Data Scientist! 8
  • 9. • Ask a friend/boss/coworker • Ask in a wider slack channel • Search in the Github repos Step 1: Search & find data 9 We end up finding a table: hosted_events that seems to be the right one
  • 10. • What does this field mean? ‒ Does hosted_events.attendance data include employees? ‒ What does hosted_events.costs include? • Dig deeper: explore using SQL queries Step 2: Understand the data 10 SELECT * FROM default.hosted_events WHERE ds=’2019-05-15’ LIMIT 100;
  • 11. Data Scientists spend upto 1/3rd time in Data Discovery 11 Data Discovery • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, how to use it,... • It is not what our data scientist should focus on: they should focus on Analysis work Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision
  • 13. User Personas - (1/2) 13 Analysts Data Scientists General Managers ExperimentersEngineersProduct Managers • Frequent use of data • Deep to very deep analysis • Exposure to new datasets • Creating insights & developing models
  • 14. User Personas - (2/2) 14 Power user - Has been at Lyft for a long time - Knows the data environment well: where to find data, what it means, how to use it Pain points: - Needs to spend a fair amount of his time sharing his knowledge with the noob user - Could become noob user if he switches teams Noob user - Recently joined Lyft - Needs to ramp up on a lot of things, wants to start having impact soon Pain points: - Doesn’t know where to start. Spends his time asking questions and cmd+F on github - Makes mistakes by mis-using some datasets
  • 15. 3 complementary ways to do Data Discovery 15 Search based I am looking for a table with data on “cancel rates” - Where is the table? - What does it contain? - Has the analysis I want to perform already been done? Lineage based If this event is down, what datasets are going to be impacted? - Upstream/downstream lineage - Incidents, SLA misses, Data quality Network based I want to check what tables my manager uses - Ownership information - Bookmarking - Usage through query logs
  • 16. Data discovery at Lyft 16 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 18. Search results ranked on relevance and query activity
  • 19. How does search work? 19
  • 20. Relevance - search for “apple” on Google 20 Low relevance High relevance
  • 21. Popularity - search for “apple” on Google 21 Low popularity High popularity
  • 22. Striking the balance 22 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 25. 25 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 27. 27 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 29. Challenge #1 Why choose Graph database? 29
  • 32. 32 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 33. Challenge #2 Why not propagate the metadata back to source? 33
  • 34. Why not propagate the metadata back to source 34
  • 36. 36 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 37. Challenge #3 Various forms of metadata 37
  • 39. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 39
  • 42. How is the databuilder orchestrated? 42 Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 44. 44 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 45. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 45
  • 46. Challenge #4 How to make the search result more relevant? 46
  • 47. How to make the search result more relevant? 47 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride)
  • 49. 49 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 50. Search results ranked on relevance and query activity
  • 53. Amundsen’s impact • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ 90% penetration among Data Scientists ‒ +30% productivity for the Data science org. 53
  • 54. Amundsen is Open Source! • Many organizations have similar problems ‒ github.com/lyft/amundsenfrontendlibrary • Collaboration with outside companies has started ‒ Used in Production by ING ‒ Working with WeWork, Square, Bang & Olufsen & more 54
  • 55. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Superset) Privacy Governance
  • 57. Summary • A metadata-powered discovery tool - like Amundsen - is key to the next wave of big data applications • Blog post with more details: go.lyft.com/datadiscoveryblog • Join the community: github.com/lyft/amundsenfrontendlibrary 57
  • 58. Phil Mizrahi | @philippemizrahi Jin Hyuk Chang | @jinhyukchang Repo at github.com/lyft/amundsenfrontendlibrary Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 58
  • 60. 60 Lyft Data Team Lyft Data Team Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
  • 62. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time Metadata: a set of data that describes and gives information about other data Ground, Joe Hellerstein, Vikram Sreekanti et al. RISE Lab, UC Berkeley
  • 63. Buy vs. Build vs. Adopt 63
  • 64. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  • 66. Search results ranked on relevance and query activity
  • 67. Detailed description and metadata about data resources
  • 69. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  • 71. Challenge #2 Pull model vs Push model 71
  • 72. Pull model and Push model 72 Pull Model Push Model ● Amundsen goes to the source and periodically update the index by pulling from the source (e.g. database). ● Very efficient on popular platform. Hard to be extended. ● The external source (e.g. database) pushes metadata to a message bus which Amundsen subscribes to. ● Flexible and extensible. Lots of work from external source. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 73. We’re Hiring! Apply at www.lyft.com/careers or email data-recruiting@lyft.com Data Engineering Engineering Manager San Francisco Software Engineer San Francisco, Seattle, & New York City Data Infrastructure Engineering Manager San Francisco Software Engineer San Francisco & Seattle Experimentation Software Engineer San Francisco Streaming Software Engineer San Francisco Observability Software Engineer San Francisco
  • 74. Strata SF 2019 Rate this session session page on conference website O’Reilly Events App