SlideShare uma empresa Scribd logo
1 de 69
Baixar para ler offline
March 2019
Mark Grover | @mark_grover | Product Management, Lyft
Tao Feng | @feng-tao | Software Engineer, Lyft
Disrupting Data Discovery
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Architecture
• Summary
2
Data platform users
3
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
5
Core Infra high level architecture
Custom apps
Data Discovery
6
• My first project is to analyze and predict Strata Attendance
• Where is the data?
• What does it mean?
Hi! I am a n00b Data Scientist!
7
• Option 1: Phone a friend!
• Option 2: Github search
Status quo
8
• What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
9
Explore
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
11
Data Scientists spend upto 1/3rd time in Data Discovery...
12
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
Audience for data
discovery
13
Data Discovery - User personas
14
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
3 Data Scientist personas
Power user
● All info in their head
● Get interrupted a lot
due to questions
● Lost
● Ask “power users” a
lot of questions
● Dependencies
landing on time
● Communicating with
stakeholders
Noob user Manager
Search based Lineage based Network based
Where is the
table/dashboard for X?
What does it contain?
I am changing a data
model, who are the owner
and most common users?
I want to follow a power
user in my team.
Does this analysis already
exist?
This table’s delivery was
delayed today, I want to
notify everyone
downstream.
I want to bookmark tables of
interest and get a feed of
data delay, schema change,
incidents.
Data Discovery answers 3 kinds of questions
Buy vs. Build vs. Adopt
17
Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
Meet Amundsen
19
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Landing page optimized for search
Search results ranked on relevance and query activity
How does search work?
22
Relevance - search for “apple” on Google
23
Low relevance High relevance
Popularity - search for “apple” on Google
24
Low popularity High popularity
Striking the balance
25
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Querying activity
● Dashboarding
● Different weights for automated vs adhoc
querying
Back to mocks...
26
Search results ranked on relevance and query activity
Detailed description and metadata about data resources
Data Preview within the tool
Computed stats about column metadata
Disclaimer: these stats are arbitrary.
Built-in user feedback
Amundsen’s architecture
32
33
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Frontend Service
34
35
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Detailed description and metadata about data resources
2. Metadata Service
37
38
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
39
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine.
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Trade Off #1
Why choose Graph
database
40
Why Graph database?
Why Graph database?
Trade Off #2
Why not propagate the
metadata back to source
43
Why not propagate the metadata back to source
44
Why not propagate the metadata back to source
45
?
?
3. Search Service
46
47
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• Support REST API for building indexes
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then relevancy
‒ Wildcard Search
48
Challenge #1
How to make the search
result more relevant?
49
How to make the search result more relevant?
50
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search
‒ Support category search (e.g. “column: is_line_ride”)
4. Data Builder
51
52
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
Challenge #1
Various forms of metadata
53
54
Metadata Sources @ Lyft
Metadata - Challenges
• Standardization: No single data model that fits for all data resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Extraction: Each data set metadata is stored and fetched differently,
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
55
Challenge #2
Pull model vs Push model
56
Pull model vs. Push model
57
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. database) pushes
metadata to a message bus which
downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
4. Databuilder
Databuilder in action
How are we building data? Databuilder
How is databuilder orchestrated?
Amundsen uses Apache Airflow to orchestrate
Databuilder jobs
What’s next?
64
Amundsen seems to be more useful than what we thought
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
• Many organizations have similar problems
‒ Collaborating with ING, WeWork and more
‒ We plan to announce open source soon
65
Impact - Amundsen at Lyft
66
Beta release
(internal)
Generally Available
(GA) release
Alpha release
Adding more kinds of data resources
PeopleDashboardsData sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
Serving more metadata about existing resources
Application Context
Existence, description, semantics, etc.
Behavior
How data is created and used over time
Change
How data is changing over time
Summary
69
Summary
• Data Discovery making data scientists unproductive
• 3 types of data discovery - search, lineage and network based
• Amundsen: Data graph for all data
• Blog post with more details: go.lyft.com/datadiscoveryblog
70
Mark Grover | @mark_grover
Tao Feng | @feng-tao
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
71
Backup
72

Mais conteúdo relacionado

Mais procurados

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with WeaviateNETWAYS
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Image models infrastructure at OLX
Image models infrastructure at OLXImage models infrastructure at OLX
Image models infrastructure at OLXAlexey Grigorev
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 

Mais procurados (20)

Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Druid
DruidDruid
Druid
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Image models infrastructure at OLX
Image models infrastructure at OLXImage models infrastructure at OLX
Image models infrastructure at OLX
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 

Semelhante a Strata sf - Amundsen presentation

Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Neo4j GraphDay Seattle- Sept19- in the enterprise
Neo4j GraphDay Seattle- Sept19-  in the enterpriseNeo4j GraphDay Seattle- Sept19-  in the enterprise
Neo4j GraphDay Seattle- Sept19- in the enterpriseNeo4j
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 

Semelhante a Strata sf - Amundsen presentation (20)

Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Neo4j GraphDay Seattle- Sept19- in the enterprise
Neo4j GraphDay Seattle- Sept19-  in the enterpriseNeo4j GraphDay Seattle- Sept19-  in the enterprise
Neo4j GraphDay Seattle- Sept19- in the enterprise
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 

Mais de Tao Feng

Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Tao Feng
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...Tao Feng
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
 

Mais de Tao Feng (6)

Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza Framework
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 

Último

introduction to python, fundamentals and basics
introduction to python, fundamentals and basicsintroduction to python, fundamentals and basics
introduction to python, fundamentals and basicsKNaveenKumarECE
 
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringC Sai Kiran
 
عناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringعناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringmennamohamed200y
 
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxConventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxSAQIB KHURSHEED WANI
 
Fabrics Finishing Manual ( Arkey Knit Dyeing Mills Ltd)
Fabrics Finishing Manual ( Arkey Knit Dyeing Mills Ltd)Fabrics Finishing Manual ( Arkey Knit Dyeing Mills Ltd)
Fabrics Finishing Manual ( Arkey Knit Dyeing Mills Ltd)Mizan Rahman
 
Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...soginsider
 
Injection Power Cycle - The most efficient power cycle
Injection Power Cycle - The most efficient power cycleInjection Power Cycle - The most efficient power cycle
Injection Power Cycle - The most efficient power cyclemarijomiljkovic1
 
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
Tekom Netherlands | The evolving landscape of Simplified Technical English  b...Tekom Netherlands | The evolving landscape of Simplified Technical English  b...
Tekom Netherlands | The evolving landscape of Simplified Technical English b...Shumin Chen
 
Evaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesEvaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesMark Billinghurst
 
Artificial organ courses Hussein L1-C2.pptx
Artificial organ courses Hussein  L1-C2.pptxArtificial organ courses Hussein  L1-C2.pptx
Artificial organ courses Hussein L1-C2.pptxHusseinMishbak
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting AlgorithmsAshutosh Satapathy
 
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptxssuser886c55
 
The Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesThe Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesJacopo Nardiello
 
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalResearch paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalDr. Manjunatha. P
 
Introduction to Data Structures .
Introduction to Data Structures        .Introduction to Data Structures        .
Introduction to Data Structures .Ashutosh Satapathy
 
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxChapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxButcher771
 

Último (20)

introduction to python, fundamentals and basics
introduction to python, fundamentals and basicsintroduction to python, fundamentals and basics
introduction to python, fundamentals and basics
 
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-2 Notes for II-II Mechanical Engineering
 
عناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringعناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineering
 
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxConventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
 
Fabrics Finishing Manual ( Arkey Knit Dyeing Mills Ltd)
Fabrics Finishing Manual ( Arkey Knit Dyeing Mills Ltd)Fabrics Finishing Manual ( Arkey Knit Dyeing Mills Ltd)
Fabrics Finishing Manual ( Arkey Knit Dyeing Mills Ltd)
 
Industry perspective on cold in-place recycling
Industry perspective on cold in-place recyclingIndustry perspective on cold in-place recycling
Industry perspective on cold in-place recycling
 
Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...Navigating Process Safety through Automation and Digitalization in the Oil an...
Navigating Process Safety through Automation and Digitalization in the Oil an...
 
Caltrans view on recycling of in-place asphalt pavements
Caltrans view on recycling of in-place asphalt pavementsCaltrans view on recycling of in-place asphalt pavements
Caltrans view on recycling of in-place asphalt pavements
 
Injection Power Cycle - The most efficient power cycle
Injection Power Cycle - The most efficient power cycleInjection Power Cycle - The most efficient power cycle
Injection Power Cycle - The most efficient power cycle
 
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
Tekom Netherlands | The evolving landscape of Simplified Technical English  b...Tekom Netherlands | The evolving landscape of Simplified Technical English  b...
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
 
Evaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesEvaluation Methods for Social XR Experiences
Evaluation Methods for Social XR Experiences
 
Artificial organ courses Hussein L1-C2.pptx
Artificial organ courses Hussein  L1-C2.pptxArtificial organ courses Hussein  L1-C2.pptx
Artificial organ courses Hussein L1-C2.pptx
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting Algorithms
 
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
 
The Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesThe Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on Kubernetes
 
FOREST FIRE USING IoT-A Visual to UG students
FOREST FIRE USING IoT-A Visual to UG studentsFOREST FIRE USING IoT-A Visual to UG students
FOREST FIRE USING IoT-A Visual to UG students
 
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalResearch paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
 
Introduction to Data Structures .
Introduction to Data Structures        .Introduction to Data Structures        .
Introduction to Data Structures .
 
Update on the latest research with regard to RAP
Update on the latest research with regard to RAPUpdate on the latest research with regard to RAP
Update on the latest research with regard to RAP
 
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptxChapter 2 Canal Falls at Mnnit Allahabad .pptx
Chapter 2 Canal Falls at Mnnit Allahabad .pptx
 

Strata sf - Amundsen presentation

  • 1. March 2019 Mark Grover | @mark_grover | Product Management, Lyft Tao Feng | @feng-tao | Software Engineer, Lyft Disrupting Data Discovery
  • 2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Architecture • Summary 2
  • 3. Data platform users 3 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 4. 5 Core Infra high level architecture Custom apps
  • 6. • My first project is to analyze and predict Strata Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 7
  • 7. • Option 1: Phone a friend! • Option 2: Github search Status quo 8
  • 8. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 9
  • 10. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 11
  • 11. Data Scientists spend upto 1/3rd time in Data Discovery... 12 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  • 13. Data Discovery - User personas 14 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 14. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  • 15. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  • 16. Buy vs. Build vs. Adopt 17
  • 17. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  • 18. Meet Amundsen 19 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 20. Search results ranked on relevance and query activity
  • 21. How does search work? 22
  • 22. Relevance - search for “apple” on Google 23 Low relevance High relevance
  • 23. Popularity - search for “apple” on Google 24 Low popularity High popularity
  • 24. Striking the balance 25 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 26. Search results ranked on relevance and query activity
  • 27. Detailed description and metadata about data resources
  • 29. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  • 32. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 34. 35 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 35. Detailed description and metadata about data resources
  • 37. 38 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 38. 39 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine. ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 39. Trade Off #1 Why choose Graph database 40
  • 42. Trade Off #2 Why not propagate the metadata back to source 43
  • 43. Why not propagate the metadata back to source 44
  • 44. Why not propagate the metadata back to source 45 ? ?
  • 46. 47 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 47. 3. Search Service • Support REST API for building indexes • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 48
  • 48. Challenge #1 How to make the search result more relevant? 49
  • 49. How to make the search result more relevant? 50 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search ‒ Support category search (e.g. “column: is_line_ride”)
  • 51. 52 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 52. Challenge #1 Various forms of metadata 53
  • 54. Metadata - Challenges • Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Extraction: Each data set metadata is stored and fetched differently, ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 55
  • 55. Challenge #2 Pull model vs Push model 56
  • 56. Pull model vs. Push model 57 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 59. How are we building data? Databuilder
  • 60. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 62. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 65
  • 63. Impact - Amundsen at Lyft 66 Beta release (internal) Generally Available (GA) release Alpha release
  • 64. Adding more kinds of data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  • 65. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time
  • 67. Summary • Data Discovery making data scientists unproductive • 3 types of data discovery - search, lineage and network based • Amundsen: Data graph for all data • Blog post with more details: go.lyft.com/datadiscoveryblog 70
  • 68. Mark Grover | @mark_grover Tao Feng | @feng-tao Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 71

Notas do Editor

  1. Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  2. Who is our audience: everyone who works at Lyft… Power users: Data Scientists, Research Scientists, Product Managers… Next: Engineers, GMs, Ops, etc.
  3. What do we do: Wide scope of work… de-mystify and democratize data at Lyft… Not mentioned here: Operational / transactional databases...
  4. What does the architecture for our core infra look like? Mobile application primarily… Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
  5. Mark
  6. Mark
  7. Mark
  8. Mark
  9. Data Discovery: How much of a challenge is it? Significant challenge… Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis… We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis… But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth… We can significantly increase productivity and impact if we can reduce this time...
  10. Mark
  11. Mark
  12. Popularity is not click through rate but through query access patterns.
  13. Amundsen architecture at Lyft: 3 micro-services(FE, metadata, search) and one generic data ingestion framework Will discuss each of the component in details High level walkthrough….how CCPA compilance works
  14. What options we have (graph, rdbms,
  15. ML features (one sentence on what is feature service) Add a logo of neo4j ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  16. Why graph db is the best option? Why not rdbms, nosql etc Why choose neo4j vs other graph db? (most popular graph db) Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data. Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
  17. Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all. We model the graph as bi-direction relationships.
  18. ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  19. ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  20. ? challenge How to improve search relevance?
  21. Think of search algorithm… Event_ride_table -> event ride table
  22. Talk about how we measure it: instrument empty results from user, % of click through rate What we did? Boost the rank of exact table name match
  23. Walk through pros and cons
  24. Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table. Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that. Loader writes data into either sink (destination) or into staging area. Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
  25. We currently index 2 a day, dependency Why we index twice a day?
  26. Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  27. A slide on Amundsen @ Lyft? ? how long it has been in prod How many datasets Users WAU Usage
  28. Kafka topic Schema registry ML workflow and Airflow DAGs
  29. Three long term type of metadata(A.B.C) We want to use the index A,B,C for the datasources we mentioned in last slide
  30. Mark