IDEAS Amundsen Presentation

•Transferir como PPTX, PDF•

0 gostou•657 visualizações

Daniel Won

Slides for IDEAS 2019 Amundsen Presentation

Dados e análise

Saturday, October 26th 2019
Alagappan Sethuraman | Engineering Manager, Lyft
Daniel Won | Software Engineer, Lyft
Disrupting Data Discovery

Agenda
• What is Data Discovery?
• Challenges in Data Discovery
• Introducing Amundsen
• Amundsen Architecture
• Impact and Future Work
2

Data is used to make informed decisions
4
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualization
4. Share insights and/or make a decision
Make data the heart of every decision

What is Data Discovery?
Consider a data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/create a visualization
4. Share insights and/or make a decision
5
Data Discovery

• My first project is predict the attendance for IDEAS conference
• Goal: Help the office team make a decision on number of chairs to
provide?
• Idea: Let’s take a look into attendance from previous conferences… but
where do I look?
Hi! I’m a new Analyst!
7

• Ask a friend/manager/coworker
• Ask in a wider Slack channel
• Search in the Github repos
Step 1: Search & find data
8
We end up finding tables: hosted_events
that seems to be the right one

• You find several columns that might be what you're looking for:
‒ booked, registered, and attendance
• But you still have many questions such as:
‒ Does attendance include staff?
‒ What's the difference between booked and registered?
‒ How accurate are these figures?
Step 2: Understand the data
9

Step 2: Understand the data
● Look for further documentation on these columns
○ Where does this documentation live?
● Ask an expert who knows this table
○ Who is an expert?
● Run some queries to try to figure it out at the risk of being wrong
10
SELECT * FROM schema.host_events
LIMIT 100;

Nearly 1/3 of Data Scientist time is spent in Data
Discovery
11
• Data discovery is a problem
because of the lack of
understanding of what data
exists, where, who owns it, & how
to use it.
• Data Discovery provides little to
no intrinsic value
• Impactful work happens in
Analysis

What is Amundsen?
• Built at Lyft, official launch in late 2018
• Inspired by Google Search, Airbnb Data Portal, and
Apache Gobblin
• Named after Norwegian explorer Roald Amundsen
‒ Led the first expedition to the South Pole
‒ Led the first expedition through the Northwest Passage
13

Computed Column Statistics
Disclaimer: these stats are arbitrary.

Neo4j is the source of truth
for editable metadata
29

Why not propagate the editabled metadata back to
source
30

Why not propagate the editabled metadata back to
source
31

Why not propagate the editabled metadata back to
source
32

Why not propagate the editabled metadata back to
source
33

Amundsen’s Impact at Lyft
• Deployed at Lyft for over 1 year
• Over 700 Weekly Active Users
• 90% penetration among Data Scientists
• Reduced mean time to discovery by 75%
• Also used by Data Eng, Software Eng, PMs, Ops, Marketing Managers,
and more
35

• github.com/lyft/amundsen
• 200+ github stars, 10+ companies contributing back
• Slack channel 250+ people from 30+ companies
• Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow, LA,
NYC by Lyft employees and community
Amundsen is Open Source!
41

Community Overview
42
ContributorsActivecommunity

Alagappan Sethuraman | /in/alagappanut
Daniel Won | /in/danwon
Project Code @ github.com/lyft/amundsen
Icons under Creative Commons License from https://thenounproject.com/
44

Mais conteúdo relacionado

Último

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Multiple time frame trading analysis -brianshannon.pdfchwongval

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

ASML's Taxonomy Adventure by Daniel Cantervoginip

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Business Analytics using Microsoft Excelysmaelreyes

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Learn How Data Science Changes Our WorldEduminds Learning

How we prevented account sharing with MFAAndrei Kaleshka

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Destaque

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Introduction to C Programming LanguageSimplilearn

Destaque (20)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Introduction to C Programming Language

IDEAS Amundsen Presentation

1. Saturday, October 26th 2019 Alagappan Sethuraman | Engineering Manager, Lyft Daniel Won | Software Engineer, Lyft Disrupting Data Discovery

2. Agenda • What is Data Discovery? • Challenges in Data Discovery • Introducing Amundsen • Amundsen Architecture • Impact and Future Work 2

3. What is Data Discovery? 3

4. Data is used to make informed decisions 4 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-driven decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualization 4. Share insights and/or make a decision Make data the heart of every decision

5. What is Data Discovery? Consider a data-driven decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/create a visualization 4. Share insights and/or make a decision 5 Data Discovery

6. Challenges in Data Discovery 6

7. • My first project is predict the attendance for IDEAS conference • Goal: Help the office team make a decision on number of chairs to provide? • Idea: Let’s take a look into attendance from previous conferences… but where do I look? Hi! I’m a new Analyst! 7

8. • Ask a friend/manager/coworker • Ask in a wider Slack channel • Search in the Github repos Step 1: Search & find data 8 We end up finding tables: hosted_events that seems to be the right one

9. • You find several columns that might be what you're looking for: ‒ booked, registered, and attendance • But you still have many questions such as: ‒ Does attendance include staff? ‒ What's the difference between booked and registered? ‒ How accurate are these figures? Step 2: Understand the data 9

10. Step 2: Understand the data ● Look for further documentation on these columns ○ Where does this documentation live? ● Ask an expert who knows this table ○ Who is an expert? ● Run some queries to try to figure it out at the risk of being wrong 10 SELECT * FROM schema.host_events LIMIT 100;

11. Nearly 1/3 of Data Scientist time is spent in Data Discovery 11 • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, & how to use it. • Data Discovery provides little to no intrinsic value • Impactful work happens in Analysis

12. Introducing Amundsen 12

13. What is Amundsen? • Built at Lyft, official launch in late 2018 • Inspired by Google Search, Airbnb Data Portal, and Apache Gobblin • Named after Norwegian explorer Roald Amundsen ‒ Led the first expedition to the South Pole ‒ Led the first expedition through the Northwest Passage 13

14. Home Page

15. Search

16. Resource Metadata

17. Resource Ownership 17

18. Data Preview 18

19. Computed Column Statistics Disclaimer: these stats are arbitrary.

20. Requesting Descriptions 20

21. User Profile 21

22. In-Application User Feedback

23. Amundsen Architecture 23

24. Amundsen Architecture 24

25. Why choose a graph database? 25

26. 26 Why Graph database? (1/2)

27. View Resource Metadata

28. 28 Why Graph database? (2/2)

29. Neo4j is the source of truth for editable metadata 29

30. Why not propagate the editabled metadata back to source 30

31. Why not propagate the editabled metadata back to source 31

32. Why not propagate the editabled metadata back to source 32

33. Why not propagate the editabled metadata back to source 33

34. Impact at Lyft 34

35. Amundsen’s Impact at Lyft • Deployed at Lyft for over 1 year • Over 700 Weekly Active Users • 90% penetration among Data Scientists • Reduced mean time to discovery by 75% • Also used by Data Eng, Software Eng, PMs, Ops, Marketing Managers, and more 35

36. Future Work 36

37. Search Preview 37

38. Advanced Search 38

39. More Metadata 39

40. We're Open Source 40

41. • github.com/lyft/amundsen • 200+ github stars, 10+ companies contributing back • Slack channel 250+ people from 30+ companies • Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow, LA, NYC by Lyft employees and community Amundsen is Open Source! 41

42. Community Overview 42 ContributorsActivecommunity

43. Thank You 43

44. Alagappan Sethuraman | /in/alagappanut Daniel Won | /in/danwon Project Code @ github.com/lyft/amundsen Icons under Creative Commons License from https://thenounproject.com/ 44

Notas do Editor

Name & Role working on an open-source data discovery tool at Lyft. It’s called “Amundsen” -- more on that name later. It leverages Neo4j, glad to share how we’ve been using Neo4j at Lyft to achieve goals of our product Amundsen.
On the agenda for this talk
Now onto challenges with data discovery
Effective data discovery is important because data is at the heart of every decision we make. It is the only way to make informed, objective decisions. Applies to many roles Data-driven decision making process Search & find data Understand the data Perform an analysis Share insights or make a decision
Now onto challenges with data discovery
To highlight some data discover pain points that occur without the proper tools, let’s walk through a hypothetical example
Your experience searching and finding data may involve doing all of the following 3 things.
Your experience understanding the data doesn’t get any easier. Each question leads to further questions - How was this data collected?
⅓ of time on data discovery Difficult to find what exists, understand whether or not it’s what you are looking for, or trust that it is the source of truth for that information We can significantly increase productivity and impact if we can reduce this time...
We’ve talked about some pain points of data discovery and why it’s important, let’s talk about our solution -- Amundsen.
Disclaimer Representative data Amundsen circa March 2019 Our landing page is optimized for search Most common method of data discovery, presented with search bar & help text for some advanced search features We also want the landing page to be able to help users that don’t know what to search for. Created this concept of popular tables
Users presented with ranked search results Not like page-rank but based on relevance and popularity
Now onto challenges with data discovery
However graph databases are not common for many web applications, and so one might ask why choose a graph database.
Well if you remember the diagram of the data ecosystem at Lyft from the beginning of the talk, that can be modeled as a graph. This is a very powerful feature because the alternative to created these kinds of relationships with a RDBMS is joins A NoSQL database isn’t set up for this
As you may remember from the application walkthrough, Amundsen surfaces resource metadata and that is what we are storing in Neo4j
Let’s take a note of some of the features from the table detail page again and see how this is represented in Neo4j Walk through features What’s very beneficial about this is that when we have a new use case and a new piece of metadata to represent, we just have to create the new node and relationship.
Another key characteristic of our system is that neo4j is the source of truth for our editable metadata
This was actually not our original intent, we ran into a roadblock when we were first implementing the description editing feature. We originally had a setup like this
Then we realized we forgot to account for something. Tables can get rebuilt using the source code that generated the table and descriptions will be overwritten
The we thought about whether or not we could do this, update them both!
The answer was no. ...And that’s how Neo4j became the source of truth for editable metadata
Now onto challenges with data discovery
Now onto challenges with data discovery
Now onto challenges with data discovery
T

IDEAS Amundsen Presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

IDEAS Amundsen Presentation

Notas do Editor