SlideShare uma empresa Scribd logo
1 de 20
Taming the ETL beast
How LinkedIn uses metadata to run
complex ETL flows reliably
Rajappa Iyer
Strata Conference, London, November 12, 2013
`whoami`

 Data Infrastructure @ LinkedIn since 2011
 Prior to that:
– Director of Engineering at Digg
– Enterprise Data Architect at eBay

 www.linkedin.com/in/rajappaiyer/
Outline of talk
 Background and Context – The Why
 Challenges with Data Delivery – The What
 Metadata to the Rescue – The How
 Q&A
LinkedIn: The World’s Largest
Professional Network
Connecting Talent  Opportunity. At scale…

259M+ 2 new
Members Worldwide

Members Per Second

100M+
Monthly Unique Visitors

3M+
Company Pages
Data Driven Products and
Insights
Products for
Members

Data,
Platforms,
Analytics

Products for
Enterprises

(Companies)

(Professionals)

Insights

(Analysts and Data
Scientists)
Products for Members
Products for Enterprises

Hire - Talent Solutions

Sell - Sales Navigator

Market - Marketing Solutions
Examples of Insights
Example of Deeper Insight

Job Migration After Financial Collapse
Data is critical to LinkedIn’s products
It needs to be delivered in a reliable
and timely manner

LinkedIn Confidential ©2013 All Rights Reserved

10
A Simplified Overview of Data Flow
Hadoop
Site
(Member
Facing
Products)

Activity
Data

Kafka

Camus

Member Data

Espresso /
Voldemort /
Oracle

DWH ETL

Product,
Sciences,
Enterprise
Analytics

Changes

Databus

External
Partner Data

Lumos

Ingest
Utilities

Computed Results for Member Facing Products

Teradata

Enterprise
Products

Core Data
Set

Derived
Data Set
Components of typical ETL jobs
 Ingress / Egress of message-oriented data
– Logs and clickstream data

 Ingress / Egress of record-oriented data
– Database data

 Transformations
–
–
–
–
–

Select, project, join
Aggregations
Partitioning
Cleansing and data normalization
Schema conversions – e.g., Nested JSON to
Relational

LinkedIn Confidential ©2013 All Rights Reserved

12
An Example ETL Flow

LinkedIn Confidential ©2013 All Rights Reserved

13
Challenges
 Complex process dependencies
– Some flows are over 30 levels deep
– Flows may span multiple platforms (Hadoop, RDBMS etc.)

 Complex data dependencies
– Multiple flows may consume a data element
– Multiple data elements feed into a single flow
– Can be viewed as “data sync barriers”

 Recovery
– Restartable flows that pick up from last checkpoint
– Catch up mode to compensate for downtime

 Monitoring and Alerting
– Prioritization of “important” flows for ops attention
– Who do you call when things fail?

LinkedIn Confidential ©2013 All Rights Reserved

14
Metadata to the rescue
 What metadata is collected?
– Process dependencies
– Data dependencies
– Execution history and data processing statistics

 How is it used?
– Drives the ETL framework with lots of functionality





Check for data availability
Retries and restarts
Standardized error reporting / alerting
Prioritized view of business critical flows

LinkedIn Confidential ©2013 All Rights Reserved

15
Metadata: Process Dependencies
 Capture process
dependency graph

Workflow F
Start

– Also capture metadata such
as process owners,
importance, SLA etc.

Workunit
W1

on success

Workunit
W2

on success

on failure

Workunit
W3

Workunit
W4

on success

on success

Workunit
W5

 Capture stats for each
execution of a workflow
– Time of execution
– Execution status
– Pointer to error logs

 Alert on delayed processes
– Based on execution history

Stop
Metadata: Data Dependencies
Data Entity
D1

Data Entity
D2

consumes

consumes

Workflow F

produces

Data Entity
D3

 For each flow, capture input
and output data elements
 For each flow execution,
capture stats on data element
 Number of records or
messages processed
 Error counts
 Watermarks
– Can be time based or
sequence based
– This can be per flow as more
than one flow can consume a
data element
Metadata: Data Elements
 Simple catalog of data elements
– Name, physical location, owner etc.

 Data elements can have logical names
– Names resolve to one or more physical entity
– Logical names can represent useful collections
 E.g., data as of a particular interval

 Data element availability can trigger processes
– E.g., kick off hourly process when hourly data is
complete and available
– Enables data driven ETL scheduling

18
Putting it all together
Dashboards
,
Reports

ETL applications

Data
Availability
Status

ETL Framework

Scheduler

Checkpoint
Execution
State

Retry /
Resume

Name
resolver

Execution
History

Data Check

Statistics
(process
and data)

Alerting /
Monitoring

Log Parsers

Data
Lineage

Metadata Management System
LinkedIn Confidential ©2013 All Rights Reserved

19
Questions?

More at data.linkedin.com
Come Work on Challenging Data Infrastructure problems - We’re Hiring

Mais conteúdo relacionado

Mais procurados

Informatica and datawarehouse Material
Informatica and datawarehouse MaterialInformatica and datawarehouse Material
Informatica and datawarehouse Materialobieefans
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architectureuncleRhyme
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse ArchitecturesTheju Paul
 
Data Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwData Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwramesh rao
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesValmik Potbhare
 
Data Integration: the Beginner's Guide
Data Integration: the Beginner's GuideData Integration: the Beginner's Guide
Data Integration: the Beginner's GuideLisa Falcone
 
Business analysis in data warehousing
Business analysis in data warehousingBusiness analysis in data warehousing
Business analysis in data warehousingHimanshu
 
Hand Coding ETL Scenarios and Challenges
Hand Coding ETL Scenarios and ChallengesHand Coding ETL Scenarios and Challenges
Hand Coding ETL Scenarios and Challengesmark madsen
 
Data integration ppt-bhawani nandan prasad - iim calcutta
Data integration ppt-bhawani nandan prasad - iim calcuttaData integration ppt-bhawani nandan prasad - iim calcutta
Data integration ppt-bhawani nandan prasad - iim calcuttaBhawani N Prasad
 

Mais procurados (20)

Informatica and datawarehouse Material
Informatica and datawarehouse MaterialInformatica and datawarehouse Material
Informatica and datawarehouse Material
 
Unit4
Unit4Unit4
Unit4
 
Data mining
Data miningData mining
Data mining
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Big Data Pitfalls
Big Data PitfallsBig Data Pitfalls
Big Data Pitfalls
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
 
Data Warehouse 101
Data Warehouse 101Data Warehouse 101
Data Warehouse 101
 
The Big Metadata
The Big MetadataThe Big Metadata
The Big Metadata
 
jagadeesh updated
jagadeesh updatedjagadeesh updated
jagadeesh updated
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwData Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bw
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
 
Data Integration: the Beginner's Guide
Data Integration: the Beginner's GuideData Integration: the Beginner's Guide
Data Integration: the Beginner's Guide
 
Aspects of data mart
Aspects of data martAspects of data mart
Aspects of data mart
 
Business analysis in data warehousing
Business analysis in data warehousingBusiness analysis in data warehousing
Business analysis in data warehousing
 
Hand Coding ETL Scenarios and Challenges
Hand Coding ETL Scenarios and ChallengesHand Coding ETL Scenarios and Challenges
Hand Coding ETL Scenarios and Challenges
 
ETL Process
ETL ProcessETL Process
ETL Process
 
data warehousing
data warehousingdata warehousing
data warehousing
 
Data integration ppt-bhawani nandan prasad - iim calcutta
Data integration ppt-bhawani nandan prasad - iim calcuttaData integration ppt-bhawani nandan prasad - iim calcutta
Data integration ppt-bhawani nandan prasad - iim calcutta
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 

Destaque

ETL Validator: Flat File Validation
ETL Validator: Flat File ValidationETL Validator: Flat File Validation
ETL Validator: Flat File ValidationDatagaps Inc
 
Managing users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageManaging users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageNR Computer Learning Center
 
Capacity Management of an ETL System
Capacity Management of an ETL SystemCapacity Management of an ETL System
Capacity Management of an ETL SystemASHOK BHATLA
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
 
ETL Validator: Creating Data Model
ETL Validator: Creating Data ModelETL Validator: Creating Data Model
ETL Validator: Creating Data ModelDatagaps Inc
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolAlex Rayón Jerez
 
Crossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latestCrossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latestCrossref
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Seven building blocks for MDM
Seven building blocks for MDMSeven building blocks for MDM
Seven building blocks for MDMKousik Mukherjee
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...Christopher Bradley
 
State of Digital Transformation 2016. Altimeter Report
State of Digital Transformation 2016. Altimeter ReportState of Digital Transformation 2016. Altimeter Report
State of Digital Transformation 2016. Altimeter ReportDen Reymer
 
Gartner: Top 10 Strategic Technology Trends 2016
Gartner: Top 10 Strategic Technology Trends 2016Gartner: Top 10 Strategic Technology Trends 2016
Gartner: Top 10 Strategic Technology Trends 2016Den Reymer
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 

Destaque (20)

Manage users & tables in Oracle Database
Manage users & tables in Oracle DatabaseManage users & tables in Oracle Database
Manage users & tables in Oracle Database
 
ETL Validator: Flat File Validation
ETL Validator: Flat File ValidationETL Validator: Flat File Validation
ETL Validator: Flat File Validation
 
Managing users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageManaging users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise Manage
 
Capacity Management of an ETL System
Capacity Management of an ETL SystemCapacity Management of an ETL System
Capacity Management of an ETL System
 
Oracle Tablespace - Basic
Oracle Tablespace - BasicOracle Tablespace - Basic
Oracle Tablespace - Basic
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
 
ETL Validator: Creating Data Model
ETL Validator: Creating Data ModelETL Validator: Creating Data Model
ETL Validator: Creating Data Model
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
 
Crossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latestCrossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latest
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Seven building blocks for MDM
Seven building blocks for MDMSeven building blocks for MDM
Seven building blocks for MDM
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
 
State of Digital Transformation 2016. Altimeter Report
State of Digital Transformation 2016. Altimeter ReportState of Digital Transformation 2016. Altimeter Report
State of Digital Transformation 2016. Altimeter Report
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Gartner: Top 10 Strategic Technology Trends 2016
Gartner: Top 10 Strategic Technology Trends 2016Gartner: Top 10 Strategic Technology Trends 2016
Gartner: Top 10 Strategic Technology Trends 2016
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 

Semelhante a Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
 
FlockData Overview
FlockData OverviewFlockData Overview
FlockData OverviewFlockData
 
Talend MDM
Talend MDMTalend MDM
Talend MDMTalend
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
 
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...AgileNetwork
 
Reducing Tool Costs
Reducing Tool CostsReducing Tool Costs
Reducing Tool CostsKalido
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkkguest4e975e2
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRyan Andhavarapu
 
Thought leadership Oct2015 selfserve
Thought leadership Oct2015 selfserveThought leadership Oct2015 selfserve
Thought leadership Oct2015 selfserveRon Krzoska
 
Industrializing Data Integration
Industrializing Data IntegrationIndustrializing Data Integration
Industrializing Data IntegrationTalend
 
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Balvinder Hira
 
Accelerating SDLC for Large Public Sector Enterprise Applications
Accelerating SDLC for Large Public Sector Enterprise ApplicationsAccelerating SDLC for Large Public Sector Enterprise Applications
Accelerating SDLC for Large Public Sector Enterprise ApplicationsSplunk
 
An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018Denodo
 
Big Data Case study - caixa bank
Big Data Case study - caixa bankBig Data Case study - caixa bank
Big Data Case study - caixa bankChungsik Yun
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemCapgemini
 
Salesforce mumbai user group june meetup
Salesforce mumbai user group   june meetupSalesforce mumbai user group   june meetup
Salesforce mumbai user group june meetupRakesh Gupta
 
Oracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsOracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsJeffrey T. Pollock
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An IntroductionDenodo
 

Semelhante a Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably (20)

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
 
FlockData Overview
FlockData OverviewFlockData Overview
FlockData Overview
 
Talend MDM
Talend MDMTalend MDM
Talend MDM
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
 
Reducing Tool Costs
Reducing Tool CostsReducing Tool Costs
Reducing Tool Costs
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkk
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
 
Thought leadership Oct2015 selfserve
Thought leadership Oct2015 selfserveThought leadership Oct2015 selfserve
Thought leadership Oct2015 selfserve
 
Industrializing Data Integration
Industrializing Data IntegrationIndustrializing Data Integration
Industrializing Data Integration
 
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...
 
Accelerating SDLC for Large Public Sector Enterprise Applications
Accelerating SDLC for Large Public Sector Enterprise ApplicationsAccelerating SDLC for Large Public Sector Enterprise Applications
Accelerating SDLC for Large Public Sector Enterprise Applications
 
An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018
 
Big Data Case study - caixa bank
Big Data Case study - caixa bankBig Data Case study - caixa bank
Big Data Case study - caixa bank
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake Ecosystem
 
Salesforce mumbai user group june meetup
Salesforce mumbai user group   june meetupSalesforce mumbai user group   june meetup
Salesforce mumbai user group june meetup
 
Jeeva_Resume
Jeeva_ResumeJeeva_Resume
Jeeva_Resume
 
Oracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsOracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast Charts
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably

  • 1. Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013
  • 2. `whoami`  Data Infrastructure @ LinkedIn since 2011  Prior to that: – Director of Engineering at Digg – Enterprise Data Architect at eBay  www.linkedin.com/in/rajappaiyer/
  • 3. Outline of talk  Background and Context – The Why  Challenges with Data Delivery – The What  Metadata to the Rescue – The How  Q&A
  • 4. LinkedIn: The World’s Largest Professional Network Connecting Talent  Opportunity. At scale… 259M+ 2 new Members Worldwide Members Per Second 100M+ Monthly Unique Visitors 3M+ Company Pages
  • 5. Data Driven Products and Insights Products for Members Data, Platforms, Analytics Products for Enterprises (Companies) (Professionals) Insights (Analysts and Data Scientists)
  • 7. Products for Enterprises Hire - Talent Solutions Sell - Sales Navigator Market - Marketing Solutions
  • 9. Example of Deeper Insight Job Migration After Financial Collapse
  • 10. Data is critical to LinkedIn’s products It needs to be delivered in a reliable and timely manner LinkedIn Confidential ©2013 All Rights Reserved 10
  • 11. A Simplified Overview of Data Flow Hadoop Site (Member Facing Products) Activity Data Kafka Camus Member Data Espresso / Voldemort / Oracle DWH ETL Product, Sciences, Enterprise Analytics Changes Databus External Partner Data Lumos Ingest Utilities Computed Results for Member Facing Products Teradata Enterprise Products Core Data Set Derived Data Set
  • 12. Components of typical ETL jobs  Ingress / Egress of message-oriented data – Logs and clickstream data  Ingress / Egress of record-oriented data – Database data  Transformations – – – – – Select, project, join Aggregations Partitioning Cleansing and data normalization Schema conversions – e.g., Nested JSON to Relational LinkedIn Confidential ©2013 All Rights Reserved 12
  • 13. An Example ETL Flow LinkedIn Confidential ©2013 All Rights Reserved 13
  • 14. Challenges  Complex process dependencies – Some flows are over 30 levels deep – Flows may span multiple platforms (Hadoop, RDBMS etc.)  Complex data dependencies – Multiple flows may consume a data element – Multiple data elements feed into a single flow – Can be viewed as “data sync barriers”  Recovery – Restartable flows that pick up from last checkpoint – Catch up mode to compensate for downtime  Monitoring and Alerting – Prioritization of “important” flows for ops attention – Who do you call when things fail? LinkedIn Confidential ©2013 All Rights Reserved 14
  • 15. Metadata to the rescue  What metadata is collected? – Process dependencies – Data dependencies – Execution history and data processing statistics  How is it used? – Drives the ETL framework with lots of functionality     Check for data availability Retries and restarts Standardized error reporting / alerting Prioritized view of business critical flows LinkedIn Confidential ©2013 All Rights Reserved 15
  • 16. Metadata: Process Dependencies  Capture process dependency graph Workflow F Start – Also capture metadata such as process owners, importance, SLA etc. Workunit W1 on success Workunit W2 on success on failure Workunit W3 Workunit W4 on success on success Workunit W5  Capture stats for each execution of a workflow – Time of execution – Execution status – Pointer to error logs  Alert on delayed processes – Based on execution history Stop
  • 17. Metadata: Data Dependencies Data Entity D1 Data Entity D2 consumes consumes Workflow F produces Data Entity D3  For each flow, capture input and output data elements  For each flow execution, capture stats on data element  Number of records or messages processed  Error counts  Watermarks – Can be time based or sequence based – This can be per flow as more than one flow can consume a data element
  • 18. Metadata: Data Elements  Simple catalog of data elements – Name, physical location, owner etc.  Data elements can have logical names – Names resolve to one or more physical entity – Logical names can represent useful collections  E.g., data as of a particular interval  Data element availability can trigger processes – E.g., kick off hourly process when hourly data is complete and available – Enables data driven ETL scheduling 18
  • 19. Putting it all together Dashboards , Reports ETL applications Data Availability Status ETL Framework Scheduler Checkpoint Execution State Retry / Resume Name resolver Execution History Data Check Statistics (process and data) Alerting / Monitoring Log Parsers Data Lineage Metadata Management System LinkedIn Confidential ©2013 All Rights Reserved 19
  • 20. Questions? More at data.linkedin.com Come Work on Challenging Data Infrastructure problems - We’re Hiring

Notas do Editor

  1. Filter, aggregation, partition, normalization, joins