SlideShare uma empresa Scribd logo
1 de 50
Baixar para ler offline
May 2018
Mark Grover | @mark_grover | Product Management, Lyft
Deepak Tiwari | @_deepaktiwari_ | Product Management, Lyft
go.lyft.com/strata18
Democratizing Data within your Organization
Agenda
• Empowering with Data
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
2
Data democratization is important...
3
democracy
noun de·moc·ra·cy  di-ˈmä-krə-sē 
: the absence of hereditary or arbitrary class distinctions or privileges
There are several challenges to data democratization...
4
• Data discovery
‒ Lack of understanding of what data exists, where, who owns it, who
uses it, and how to request access.
• Data tools
‒ Creation: Productivity and technical knowledge (e.g. ETL)
‒ Consumption: Tools for exploration and analysis (e.g. Visualization,
attribution, etc.)
Data Scientists spend upto 1/3rd time in Data Discovery...
5
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
Lyft: Fastest growing ride hailing service in North America
6
7
Lyft Data Team
Lyft Data Team
Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
Data platform users
8
Data Modelers Data Scientists Research
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
9
Core Infra high level architecture
Custom apps
10
Life of an event
Golden
Path
Client: iOS / Android
Events stored locally
Server
Call Ingest
Pub /
Sub
(Kafka)
Core Infra:
Hadoop: Hive
/ Presto
Visualization
/ Query
Layer
Read
Monitoring / Anodot
Streamcheck
Storage
(S3)
Stream Ingest
ETLs
Event
Data Discovery
11
• My first project is to analyze and predict Strata Attendance
Hi! I am a n00b Data Scientist!
12
• Where is the data?
• What does it mean?
First questions
13
Option #1 - github search for “strata attendance”
● But which one do I use?
• Doesn’t scale
• Sometimes YOU are the first one!
Option #2 - Ask a power user
15
• What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
16
Explore
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
18
But!
The comment is out of date...
Now what?
Github PR hell!
Goal: Productive and effective tool for data discovery
Have we met our goal?
Goal: Productive and effective tool for data discovery
Have we met our goal?
Audience for data
discovery
23
Data Discovery - User personas
24
Data Modelers Data Scientists Research
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
3 Data Scientist personas
Power user
● All info in their head
● Get interrupted a lot
due to questions
● Lost
● Ask “power users” a
lot of questions
● Dependencies
landing on time
● Communicating with
stakeholders
Noob user Manager
Search based Lineage based Network based
Where is the
table/dashboard for X?
What does it contain?
I am changing a data
model, who are the owner
and most common users?
I want to follow a power
user in my team.
Does this analysis already
exist?
This table’s delivery was
delayed today, I want to
notify everyone
downstream.
I want to bookmark tables of
interest and get a feed of
data delay, schema change,
incidents.
Data Discovery answers 3 kinds of questions
Data graph - 3 kinds of nodes
PeopleAnalysisData sets
Summary
• Primarily for data scientists
• Index information about data sets, analysis and people
• Answer search based, lineage based and network based questions
28
Buy vs. Build vs. Adopt
29
3 kinds of questions
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
Meet Amundsen
31
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Amundsen - landing page
Amundsen - table detail page
Amundsen - column details
Amundsen - column details
Amundsen Architecture
36
Pillar #1
Building a data graph
37
38
Search service Graph service PostgreSQL service
Update description
Update
metadata
Front end service
Pillar #2
Push model vs. Pull Model
39
Pull model vs. push model
40
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. database) pushes
metadata to a message bus which
downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
Pull model vs. push model
41
Pull Model Push Model
● Onus of integration lays on data graph
● No interface to prescribe, hard to maintain
crawlers
● Onus of integration lies on database
● Message format serves as the interface
● Allows for near-real time indexing
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
Pull model vs. push model
42
Pull Model Push Model
● Onus of integration lays on data graph
● No interface to prescribe, hard to maintain
crawlers
● Onus of integration lies on database
● Message format serves as the interface
● Allows for near-real time indexing
Crawler
Database Data graph Database Message
queue
Data graph
Preferred if
● Near-real time indexing is important
● Clean interface doesn’t exist
● Other tools like Wherehows are moving
towards Push Model
Preferred if
● Waiting for indexing is ok
● Working with “strapped” teams
● There’s already an interface
Pillar #3
Relevance vs. Popularity
43
Relevance - search for “apple” on Google
44
Low relevance High relevance
Popularity - search for “apple” on Google
45
Low popularity High popularity
Striking the balance
46
Relevance Popularity
● Descriptions, owners, frequent users ● Querying activity
● Dashboarding
● Different weights for automated vs adhoc
querying
Summary
47
Summary
• High level architecture & user personas
• Data Discovery making data scientists unproductive
• 3 types of data discovery - search, lineage and network based
• 3 types of data graph nodes - data sets, analysis and users
• 3 pillars of Amundsen architecture
‒ Building a data graph
‒ Push vs. pull model
‒ Relevance vs. popularity
• Lyft’s work in progress for data discovery - Amundsen 48
Mark Grover | @mark_grover
Deepak Tiwari | @_deepaktiwari_
go.lyft.com/strata18
Icons under Creative Commons License from https://thenounproject.com/
49
Amundsen - table detail page
Relevance Popularity

Mais conteúdo relacionado

Mais procurados

Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at AirbnbNeo4j
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsOntotext
 
Stardog Linked Data Catalog
Stardog Linked Data CatalogStardog Linked Data Catalog
Stardog Linked Data Catalogkendallclark
 
Building a Graph-based Analytics Platform
Building a Graph-based Analytics PlatformBuilding a Graph-based Analytics Platform
Building a Graph-based Analytics PlatformKenny Bastani
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHeroMaggie Hays
 
BigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchBigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchTO THE NEW | Technology
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsStéphane Fréchette
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsDataWorks Summit
 

Mais procurados (20)

Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
DataHub
DataHubDataHub
DataHub
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
Stardog Linked Data Catalog
Stardog Linked Data CatalogStardog Linked Data Catalog
Stardog Linked Data Catalog
 
Building a Graph-based Analytics Platform
Building a Graph-based Analytics PlatformBuilding a Graph-based Analytics Platform
Building a Graph-based Analytics Platform
 
Digital Types
Digital TypesDigital Types
Digital Types
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHero
 
BigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchBigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearch
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 

Semelhante a Democratizing Data within your organization - Data Discovery

Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Open data for development
Open data for developmentOpen data for development
Open data for developmentmlepage
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Data Collaboration Stack
Data Collaboration StackData Collaboration Stack
Data Collaboration StackPierre Brunelle
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)Thinkful
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
Data Driven: The Ancestry.com Journey to Self-Service Analytics
Data Driven: The Ancestry.com Journey to Self-Service AnalyticsData Driven: The Ancestry.com Journey to Self-Service Analytics
Data Driven: The Ancestry.com Journey to Self-Service AnalyticsWilliam Yetman
 
Human Computation for Big Data
Human Computation for Big DataHuman Computation for Big Data
Human Computation for Big DataeXascale Infolab
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel geektimecoil
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 

Semelhante a Democratizing Data within your organization - Data Discovery (20)

Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Open data for development
Open data for developmentOpen data for development
Open data for development
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Data Collaboration Stack
Data Collaboration StackData Collaboration Stack
Data Collaboration Stack
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Data Driven: The Ancestry.com Journey to Self-Service Analytics
Data Driven: The Ancestry.com Journey to Self-Service AnalyticsData Driven: The Ancestry.com Journey to Self-Service Analytics
Data Driven: The Ancestry.com Journey to Self-Service Analytics
 
Human Computation for Big Data
Human Computation for Big DataHuman Computation for Big Data
Human Computation for Big Data
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 

Último

Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 

Último (20)

SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 

Democratizing Data within your organization - Data Discovery

  • 1. May 2018 Mark Grover | @mark_grover | Product Management, Lyft Deepak Tiwari | @_deepaktiwari_ | Product Management, Lyft go.lyft.com/strata18 Democratizing Data within your Organization
  • 2. Agenda • Empowering with Data • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft 2
  • 3. Data democratization is important... 3 democracy noun de·moc·ra·cy di-ˈmä-krə-sē : the absence of hereditary or arbitrary class distinctions or privileges
  • 4. There are several challenges to data democratization... 4 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access. • Data tools ‒ Creation: Productivity and technical knowledge (e.g. ETL) ‒ Consumption: Tools for exploration and analysis (e.g. Visualization, attribution, etc.)
  • 5. Data Scientists spend upto 1/3rd time in Data Discovery... 5 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  • 6. Lyft: Fastest growing ride hailing service in North America 6
  • 7. 7 Lyft Data Team Lyft Data Team Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra
  • 8. Data platform users 8 Data Modelers Data Scientists Research Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 9. 9 Core Infra high level architecture Custom apps
  • 10. 10 Life of an event Golden Path Client: iOS / Android Events stored locally Server Call Ingest Pub / Sub (Kafka) Core Infra: Hadoop: Hive / Presto Visualization / Query Layer Read Monitoring / Anodot Streamcheck Storage (S3) Stream Ingest ETLs Event
  • 12. • My first project is to analyze and predict Strata Attendance Hi! I am a n00b Data Scientist! 12
  • 13. • Where is the data? • What does it mean? First questions 13
  • 14. Option #1 - github search for “strata attendance” ● But which one do I use?
  • 15. • Doesn’t scale • Sometimes YOU are the first one! Option #2 - Ask a power user 15
  • 16. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 16
  • 18. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 18
  • 19. But! The comment is out of date... Now what?
  • 21. Goal: Productive and effective tool for data discovery Have we met our goal?
  • 22. Goal: Productive and effective tool for data discovery Have we met our goal?
  • 24. Data Discovery - User personas 24 Data Modelers Data Scientists Research Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 25. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  • 26. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  • 27. Data graph - 3 kinds of nodes PeopleAnalysisData sets
  • 28. Summary • Primarily for data scientists • Index information about data sets, analysis and people • Answer search based, lineage based and network based questions 28
  • 29. Buy vs. Build vs. Adopt 29
  • 30. 3 kinds of questions Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  • 31. Meet Amundsen 31 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 33. Amundsen - table detail page
  • 34. Amundsen - column details
  • 35. Amundsen - column details
  • 37. Pillar #1 Building a data graph 37
  • 38. 38 Search service Graph service PostgreSQL service Update description Update metadata Front end service
  • 39. Pillar #2 Push model vs. Pull Model 39
  • 40. Pull model vs. push model 40 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 41. Pull model vs. push model 41 Pull Model Push Model ● Onus of integration lays on data graph ● No interface to prescribe, hard to maintain crawlers ● Onus of integration lies on database ● Message format serves as the interface ● Allows for near-real time indexing Crawler Database Data graph Scheduler Database Message queue Data graph
  • 42. Pull model vs. push model 42 Pull Model Push Model ● Onus of integration lays on data graph ● No interface to prescribe, hard to maintain crawlers ● Onus of integration lies on database ● Message format serves as the interface ● Allows for near-real time indexing Crawler Database Data graph Database Message queue Data graph Preferred if ● Near-real time indexing is important ● Clean interface doesn’t exist ● Other tools like Wherehows are moving towards Push Model Preferred if ● Waiting for indexing is ok ● Working with “strapped” teams ● There’s already an interface
  • 43. Pillar #3 Relevance vs. Popularity 43
  • 44. Relevance - search for “apple” on Google 44 Low relevance High relevance
  • 45. Popularity - search for “apple” on Google 45 Low popularity High popularity
  • 46. Striking the balance 46 Relevance Popularity ● Descriptions, owners, frequent users ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 48. Summary • High level architecture & user personas • Data Discovery making data scientists unproductive • 3 types of data discovery - search, lineage and network based • 3 types of data graph nodes - data sets, analysis and users • 3 pillars of Amundsen architecture ‒ Building a data graph ‒ Push vs. pull model ‒ Relevance vs. popularity • Lyft’s work in progress for data discovery - Amundsen 48
  • 49. Mark Grover | @mark_grover Deepak Tiwari | @_deepaktiwari_ go.lyft.com/strata18 Icons under Creative Commons License from https://thenounproject.com/ 49
  • 50. Amundsen - table detail page Relevance Popularity