SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
Wednesday, September 18th 2019
Tamika Tannis | @ttannis | Software Engineer, Lyft
go.lyft.com/datadiscoveryslides
Disrupting Data Discovery
Agenda
• Data Ecosystem at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Amundsen’s Architecture
• What’s Next?
2
Data Ecosystem at Lyft
3
4
Core Data Infrastructure (High Level)
Custom
Applications
Architecture Applications
Mobile App
Services
Services
Data Streaming
Frameworks
(Kafka / Kinesis)
Flink
Challenges with Data
Discovery
5
Data is used to make informed decisions
6
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or make a decision
Make data the heart of every decision
• Goal: What new data-driven policies can we enact to reduce driver
insurance fraud?
• Idea: Let’s take a deeper look into insurance claims from drivers who
have given less than 𝑥 rides.
• Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but
where do I look?
Hi! I’m a new Analyst in the Fraud Department !
7
• Ask a friend/manager/coworker
• Ask in a wider Slack channel
• Search in the Github repos
Step 1: Search & find data
8
We end up finding tables: driver_rides
& rides_driver_total
• What is the difference: driver_rides vs. rides_driver_total
• What do the different fields mean?
‒ Is driver_rides.completed different from
rides_driver_total.lifetime_completed?
‒ What period of time does the data in each table cover?
• Dig deeper: explore using SQL queries
Step 2: Understand the data
9
SELECT * FROM schema.driver_rides
WHERE ds=’2019-05-15’
LIMIT 100;
SELECT * FROM schema.rides_driver_total
WHERE ds=’2019-05-15’
LIMIT 100;
Data Scientists spend upto 1/3rd time in Data Discovery
10
Data Discovery
• Data discovery is a problem
because of the lack of understanding
of what data exists, where, who owns
it, & how to use it.
• It is not what our data scientist
should focus on: they should focus
on Analysis work
Data-based decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision
Audience for data
discovery
11
User Personas - (1/2)
12
Analysts Data Scientists General
Managers
ExperimentersEngineersProduct
Managers
• Frequent use of data
• Deep to very deep analysis
• Exposure to new datasets
• Creating insights & developing
models
User Personas - (2/2)
13
Power User
- Has been at Lyft for a long
time
- Knows the data environment
well: where to find data, what
it means, how to use it
Pain points:
- Needs to spend a fair amount
of their time sharing their
knowledge with the new user
- Could become “New user” if
they switch teams
New User
- Recently joined Lyft or
switched to a new team
- Needs to ramp up on a lot of
things, wants to start having
impact soon
Pain points:
- Doesn’t know where to start.
Spends their time asking
questions and cmd+F on
github
- Makes mistakes by mis-using
some datasets
3 complementary ways to do Data Discovery
14
Search based
I am looking for a table with data on “cancel rates”
- Where is the table?
- What does it contain?
- Has the analysis I want to perform already been done?
Lineage based
If this event is down, what datasets are going to be impacted?
- Upstream/downstream lineage
- Incidents, SLA misses, Data quality
Network based
I want to check what tables my manager uses
- Ownership information
- Bookmarking
- Usage through query logs
Data Discovery at Lyft
15
Product named after Roald Amundsen
● First expedition to reach the South Pole
● First to explore both North & South Poles
Landing Page - Optimized for search
Search Results - Ranked on relevance & popularity
Relevance - search for “apple” on Google
18
Low relevance High relevance
Popularity - search for “apple” on Google
19
Low popularity High popularity
Search Results - Striking the balance
20
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Different weights for different metadata, e.g.
resource name
● Querying activity
● Dashboarding
● Lower weight for automated querying
● Higher weight for adhoc querying
View Resource Metadata
Data Preview
22
View Resource Metadata
Computed Column Metadata Statistics
Disclaimer: these stats are arbitrary.
View Resource Metadata
In-Application User Feedback
Amundsen’s Architecture
27
28
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Metadata Service
29
30
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
View Resource Metadata
Why choose a graph
database?
32
33
Why Graph database? (1/2)
34
Why Graph database? (2/2)
35
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Neo4j is the source of truth for
editable metadata
36
Why not propagate the editabled metadata back to source
37
Why not propagate the editabled metadata back to source
38
Why not propagate the editabled metadata back to source
39
Why not propagate the editabled metadata back to source
40
2. Databuilder
41
42
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
43
Metadata Sources @ Lyft
Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
44
Databuilder
45
Databuilder in action
46
How is the databuilder orchestrated?
47
Amundsen uses Apache Airflow to orchestrate Databuilder jobs
3. Search Service
48
49
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
50
How to make the search result more relevant?
51
• Experiment with different weights, e.g boost the exact table ranking
• Collect metrics
‒ Instrumentation for search behavior
‒ Measure click-through-rate (CTR) over top 5 results
• Advanced search:
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
‒ Future: Filtering, Autosuggest
4. Frontend Service
52
53
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Web Application
Web Technologies
55
Develop Build Test
What’s Next?
56
Amundsen’s Impact
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
‒ 90% penetration among Data Scientists
‒ +30% productivity for the Data science org.
57
Amundsen is Open Source!
• github.com/lyft/amundsen
• Growing and active community
‒ c.150 github stars, 10+ companies contributing back
‒ Slack w/ 30+ companies and c.100 people
‒ Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow by Lyft
employees and community
‒ Featured in blog posts and interviews
• Net positive impact for Lyft through external community contributing
‒ Integration with open source backend
‒ Integration with new data sources (BigQuery, Redshift, Postgres), lifting them from
our roadmap
58
Community Overview
59
ContributorsActivecommunity
Roadmap
PeopleDashboards
Data sets
Phase 1
(Complete)
Phase 2
(In Progress)
Phase 3
(In Scoping)
Streams Schemas Workflows
More
Metadata
Deeper integration with other
tools (e.g. Mode)
Privacy Governance
Amundsen People
61
Amundsen People
62
Roadmap
PeopleDashboards
Data sets
Phase 1
(Complete)
Phase 2
(In Progress)
Phase 3
(In Scoping)
Streams Schemas Workflows
More
Metadata
Deeper integration with other
tools (e.g. Mode)
Privacy Governance
Roadmap
64
Roadmap
65
Tamika Tannis | @ttannis | /in/tamika-tannis
Project Code @ github.com/lyft/amundsen
Blog Post @ go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
66

Mais conteúdo relacionado

Mais procurados

Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
SplunkSummit 2015 - A Quick Guide to Search Optimization
SplunkSummit 2015 - A Quick Guide to Search OptimizationSplunkSummit 2015 - A Quick Guide to Search Optimization
SplunkSummit 2015 - A Quick Guide to Search OptimizationSplunk
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelNeo4j
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 

Mais procurados (20)

Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
SplunkSummit 2015 - A Quick Guide to Search Optimization
SplunkSummit 2015 - A Quick Guide to Search OptimizationSplunkSummit 2015 - A Quick Guide to Search Optimization
SplunkSummit 2015 - A Quick Guide to Search Optimization
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global Travel
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 

Semelhante a Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Using Data Science for Cybersecurity
Using Data Science for CybersecurityUsing Data Science for Cybersecurity
Using Data Science for CybersecurityVMware Tanzu
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Citi Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationCiti Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationMarquis Cabrera
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxRupaRani28
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Advanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout SessionAdvanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout SessionSplunk
 
Demystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDemystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDataWorks Summit
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with DatabricksGrega Kespret
 

Semelhante a Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation (20)

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Using Data Science for Cybersecurity
Using Data Science for CybersecurityUsing Data Science for Cybersecurity
Using Data Science for Cybersecurity
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Citi Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationCiti Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics Presentation
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Advanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout SessionAdvanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout Session
 
Demystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time AnalyticsDemystifying Systems for Interactive and Real-time Analytics
Demystifying Systems for Interactive and Real-time Analytics
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 

Último

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Último (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation

  • 1. Wednesday, September 18th 2019 Tamika Tannis | @ttannis | Software Engineer, Lyft go.lyft.com/datadiscoveryslides Disrupting Data Discovery
  • 2. Agenda • Data Ecosystem at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Amundsen’s Architecture • What’s Next? 2
  • 4. 4 Core Data Infrastructure (High Level) Custom Applications Architecture Applications Mobile App Services Services Data Streaming Frameworks (Kafka / Kinesis) Flink
  • 6. Data is used to make informed decisions 6 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-driven decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or make a decision Make data the heart of every decision
  • 7. • Goal: What new data-driven policies can we enact to reduce driver insurance fraud? • Idea: Let’s take a deeper look into insurance claims from drivers who have given less than 𝑥 rides. • Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but where do I look? Hi! I’m a new Analyst in the Fraud Department ! 7
  • 8. • Ask a friend/manager/coworker • Ask in a wider Slack channel • Search in the Github repos Step 1: Search & find data 8 We end up finding tables: driver_rides & rides_driver_total
  • 9. • What is the difference: driver_rides vs. rides_driver_total • What do the different fields mean? ‒ Is driver_rides.completed different from rides_driver_total.lifetime_completed? ‒ What period of time does the data in each table cover? • Dig deeper: explore using SQL queries Step 2: Understand the data 9 SELECT * FROM schema.driver_rides WHERE ds=’2019-05-15’ LIMIT 100; SELECT * FROM schema.rides_driver_total WHERE ds=’2019-05-15’ LIMIT 100;
  • 10. Data Scientists spend upto 1/3rd time in Data Discovery 10 Data Discovery • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, & how to use it. • It is not what our data scientist should focus on: they should focus on Analysis work Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision
  • 12. User Personas - (1/2) 12 Analysts Data Scientists General Managers ExperimentersEngineersProduct Managers • Frequent use of data • Deep to very deep analysis • Exposure to new datasets • Creating insights & developing models
  • 13. User Personas - (2/2) 13 Power User - Has been at Lyft for a long time - Knows the data environment well: where to find data, what it means, how to use it Pain points: - Needs to spend a fair amount of their time sharing their knowledge with the new user - Could become “New user” if they switch teams New User - Recently joined Lyft or switched to a new team - Needs to ramp up on a lot of things, wants to start having impact soon Pain points: - Doesn’t know where to start. Spends their time asking questions and cmd+F on github - Makes mistakes by mis-using some datasets
  • 14. 3 complementary ways to do Data Discovery 14 Search based I am looking for a table with data on “cancel rates” - Where is the table? - What does it contain? - Has the analysis I want to perform already been done? Lineage based If this event is down, what datasets are going to be impacted? - Upstream/downstream lineage - Incidents, SLA misses, Data quality Network based I want to check what tables my manager uses - Ownership information - Bookmarking - Usage through query logs
  • 15. Data Discovery at Lyft 15 Product named after Roald Amundsen ● First expedition to reach the South Pole ● First to explore both North & South Poles
  • 16. Landing Page - Optimized for search
  • 17. Search Results - Ranked on relevance & popularity
  • 18. Relevance - search for “apple” on Google 18 Low relevance High relevance
  • 19. Popularity - search for “apple” on Google 19 Low popularity High popularity
  • 20. Search Results - Striking the balance 20 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Different weights for different metadata, e.g. resource name ● Querying activity ● Dashboarding ● Lower weight for automated querying ● Higher weight for adhoc querying
  • 24. Computed Column Metadata Statistics Disclaimer: these stats are arbitrary.
  • 28. 28 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 30. 30 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 32. Why choose a graph database? 32
  • 35. 35 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 36. Neo4j is the source of truth for editable metadata 36
  • 37. Why not propagate the editabled metadata back to source 37
  • 38. Why not propagate the editabled metadata back to source 38
  • 39. Why not propagate the editabled metadata back to source 39
  • 40. Why not propagate the editabled metadata back to source 40
  • 42. 42 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 44. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 44
  • 47. How is the databuilder orchestrated? 47 Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 49. 49 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 50. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 50
  • 51. How to make the search result more relevant? 51 • Experiment with different weights, e.g boost the exact table ranking • Collect metrics ‒ Instrumentation for search behavior ‒ Measure click-through-rate (CTR) over top 5 results • Advanced search: ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride) ‒ Future: Filtering, Autosuggest
  • 53. 53 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 57. Amundsen’s Impact • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ 90% penetration among Data Scientists ‒ +30% productivity for the Data science org. 57
  • 58. Amundsen is Open Source! • github.com/lyft/amundsen • Growing and active community ‒ c.150 github stars, 10+ companies contributing back ‒ Slack w/ 30+ companies and c.100 people ‒ Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow by Lyft employees and community ‒ Featured in blog posts and interviews • Net positive impact for Lyft through external community contributing ‒ Integration with open source backend ‒ Integration with new data sources (BigQuery, Redshift, Postgres), lifting them from our roadmap 58
  • 60. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In Progress) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Mode) Privacy Governance
  • 63. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In Progress) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Mode) Privacy Governance
  • 66. Tamika Tannis | @ttannis | /in/tamika-tannis Project Code @ github.com/lyft/amundsen Blog Post @ go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 66