Designing Data Pipelines Using Hadoop

•Transferir como PPTX, PDF•

10 gostaram•10,932 visualizações

This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting.

Tecnologia Negócios

Rocket Fuel
Big Data and Artificial Intelligence for Digital Advertising
Abhijit Pol
Marilson Campos
Designing Data Pipelines
July, 2013

What We Do?
Data
Partners*
Optimize
Bid
Request
Rocket Fuel
Winning Ad
Ad Request
Ad Served to
User
Page
Request
Bid & Ad
Web Browser
Rocket Fuel Platform
Real-time Bidder
Automated Decisions
Response
Prediction
Model
Publishers
User
Engagement
Recorded
User Engages with
Ad
Refresh
learning
Campaign &
User Data
Warehouse
Qualify
Audience
Some Exchange Partners
Ad
Excha
nge
Ads &
Budget

How Big Is This Problem Each Day?
Trades on NASDAQ
Facebook Page Views
Searches on Google
Bid Requests Considered by Rocket Fuel

How Big Is This Problem Each Day?
Trades on NASDAQ
Facebook Page Views
Searches on Google
Bid Requests Considered by Rocket Fuel
~5 billion
10 million
30 billion
~20 billion

Outline
•Architecture Evolution
•Hurdles and Challenges Faced
•Data Pipelines Best Practices

Architecture for Growth
•20 GB/month to 2 PB/month in 3 years
•New and complex requirements
•More consumers
•Rapid growth

Hurdles and Challenges Faced
•Exponential data growth and user
queries
•Network issues
•Bots
•Bad user queries

Data Pipeline Design Best Practices
Job Design
Consistency
Job Features
Avoid Re-work Golden Input
Shadow Cluster
Data Collection
Dashboard

Job Design / Consistency
•Idempotent
•Execution by different users
•Account for Execution Time

Job Features / Re-Work
•Smaller Jobs
•Record completion of steps

Recording completion times
Start
Is mark
already
there?
Step of workflow, job or script
Yes
No
Execute work
for the step.
Create the mark
End
Collect other
data (Optional)

Golden Input / Shadow Cluster
•Integration tests on realistic data sets.
•Safe environment to innovate.

Data Collection - Delivery time view
J
Data product
Workflow Workflow
Job
Job
Job Job
Job Job
Job
Job
JobJob
Job
Hive/Pig SSH Script
J J… J
J
Hive
J J J
Pig
…

Data collection : Data profiles view
Data product
Data set
Data set
= Data Set
= Transformation
Record Size & Type
Job
Counts
Join success ratios Data Set
Consistency

Data Collection Hierarchy
wk_external_events
wk_build_profile
user_profile
extract_fields
consolidate_metrics
load_into_data_centers
extract_features
compact_user_profile
Workflow/Job/Script StepData Product

Mais conteúdo relacionado

Mais procurados

Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit

Notebooks @ Netflix: From analytics to engineering with Jupyter notebooksMichelle Ufford

Data in Motion vs Data at RestInternap

The analytics journey at Viewbix - how they came to use Snowplow and the setu...yalisassoon

Introduction to Google Cloud Platform for Big Data - Trusted ConfIn Marketing We Trust

Snowplow the evolving data pipelineyalisassoon

Data driven video advertising campaigns - JustWatch & SnowplowGiuseppe Gaviani

Spark Summit Keynote by Shaun ConnollySpark Summit

How we use Hive at SnowPlow, and how the role of HIve is changingyalisassoon

Altis Webinar: Use Cases For The Modern Data PlatformAltis Consulting

Big Data and ML on Google CloudWlodek Bielski

Building the Ideal Stack for Machine LearningSingleStore

CTO View: Driving the On-Demand Economy with Predictive AnalyticsSingleStore

Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...Databricks

TripleLift: Preparing for a New Programmatic Ad-Tech WorldVoltDB

DATA @ NFLX (Tableau Conference 2014 Presentation)Blake Irvine

Driving the On-Demand Economy with Predictive AnalyticsSingleStore

The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore

Snowplow - Evolve your analytics stack with your businessGiuseppe Gaviani

Simply Business - Near Real Time Event Processingidan_by

Mais procurados (20)

Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl

Notebooks @ Netflix: From analytics to engineering with Jupyter notebooks

Data in Motion vs Data at Rest

The analytics journey at Viewbix - how they came to use Snowplow and the setu...

Introduction to Google Cloud Platform for Big Data - Trusted Conf

Snowplow the evolving data pipeline

Data driven video advertising campaigns - JustWatch & Snowplow

Spark Summit Keynote by Shaun Connolly

How we use Hive at SnowPlow, and how the role of HIve is changing

Altis Webinar: Use Cases For The Modern Data Platform

Big Data and ML on Google Cloud

Building the Ideal Stack for Machine Learning

CTO View: Driving the On-Demand Economy with Predictive Analytics

Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...

TripleLift: Preparing for a New Programmatic Ad-Tech World

DATA @ NFLX (Tableau Conference 2014 Presentation)

Driving the On-Demand Economy with Predictive Analytics

The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics

Snowplow - Evolve your analytics stack with your business

Simply Business - Near Real Time Event Processing

Semelhante a Designing Data Pipelines Using Hadoop

Microsoft bing ads product overview Samia Kesseiri

Digital Strategy for future businessAshish Bhasin

Taming the Big Data Beast to Drive More Internet SalesVickie Gibbs

Tạp trí Internet Marketing Số 19 - FEB 2013Nguyễn Văn Mạnh

Socitm Supplier Briefing LondonSocitm

Socitm Supplier Briefing LondonSocitm Briefings

BrightEdge Share15 - S305: Data Learning & Decision Making – Crawl, Walk & Ru...BrightEdge Technologies

Trajectory Series i-Corps How Your Startup Makes $$ (Feb 2021)Dave Parker

Unifying Marketing Data & Multi-Touch Attribution AnalysisPrinciple America

Making display advertising work for dealers - pdf - sept 24,14Ian Cruickshank

Stop drowning in data 032013Vickie Gibbs

Invite media playbook reportAdCMO

Invite media playbookAdCMO

Digital marketing strategy playbookAdCMO

Digital Marketing Approach - FinoitVibes Communications Pvt Ltd

Digital Velocity 2014 Morning Keynote: "Building an Effective Digital Marketi...Tealium

David cutler projects and activitiesSales Strategy and Innovation Delivery

Computational Marketing at Groupon - JCSSE 2017Clovis Chapman

Gartner AADI 2010 Sponsor PresentationPascal Winckel

Nicholas Gorski: Real-time revenue science at TwitterDavid Garrison

Semelhante a Designing Data Pipelines Using Hadoop (20)

Microsoft bing ads product overview

Digital Strategy for future business

Taming the Big Data Beast to Drive More Internet Sales

Tạp trí Internet Marketing Số 19 - FEB 2013

Socitm Supplier Briefing London

BrightEdge Share15 - S305: Data Learning & Decision Making – Crawl, Walk & Ru...

Trajectory Series i-Corps How Your Startup Makes $$ (Feb 2021)

Unifying Marketing Data & Multi-Touch Attribution Analysis

Making display advertising work for dealers - pdf - sept 24,14

Stop drowning in data 032013

Invite media playbook report

Invite media playbook

Digital marketing strategy playbook

Digital Marketing Approach - Finoit

Digital Velocity 2014 Morning Keynote: "Building an Effective Digital Marketi...

David cutler projects and activities

Computational Marketing at Groupon - JCSSE 2017

Gartner AADI 2010 Sponsor Presentation

Nicholas Gorski: Real-time revenue science at Twitter

Mais de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mais de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Último

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

CloudStudio User manual (basic edition):comworks

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Story boards and shot lists for my a level piececharlottematthew16

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

From Family Reminiscence to Scholarly Archive .Alan Dix

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Advanced Computer Architecture – An IntroductionDilum Bandara

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Designing Data Pipelines Using Hadoop

1. Rocket Fuel Big Data and Artificial Intelligence for Digital Advertising Abhijit Pol Marilson Campos Designing Data Pipelines July, 2013

2. What We Do? Data Partners* Optimize Bid Request Rocket Fuel Winning Ad Ad Request Ad Served to User Page Request Bid & Ad Web Browser Rocket Fuel Platform Real-time Bidder Automated Decisions Response Prediction Model Publishers User Engagement Recorded User Engages with Ad Refresh learning Campaign & User Data Warehouse Qualify Audience Some Exchange Partners Ad Excha nge Ads & Budget

3. How Big Is This Problem Each Day? Trades on NASDAQ Facebook Page Views Searches on Google Bid Requests Considered by Rocket Fuel

4. How Big Is This Problem Each Day? Trades on NASDAQ Facebook Page Views Searches on Google Bid Requests Considered by Rocket Fuel ~5 billion 10 million 30 billion ~20 billion

5. BIG DATA + AI

6. Advertising That Learns

7. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices

8. Architecture for Growth •20 GB/month to 2 PB/month in 3 years •New and complex requirements •More consumers •Rapid growth

9. How We Started?

10. Architecture 2.0

11. Current Architecture

12. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices

13. Hurdles and Challenges Faced •Exponential data growth and user queries •Network issues •Bots •Bad user queries

14. Outline •Architecture Evolution •Hurdles and Challenges Faced •Data Pipelines Best Practices

15. Data Pipeline Design Best Practices Job Design Consistency Job Features Avoid Re-work Golden Input Shadow Cluster Data Collection Dashboard

16. Job Design / Consistency •Idempotent •Execution by different users •Account for Execution Time

17. Job Execution Timeline

18. Job Features / Re-Work •Smaller Jobs •Record completion of steps

19. Recording completion times Start Is mark already there? Step of workflow, job or script Yes No Execute work for the step. Create the mark End Collect other data (Optional)

20. Golden Input / Shadow Cluster •Integration tests on realistic data sets. •Safe environment to innovate.

21. Data Collection - Delivery time view J Data product Workflow Workflow Job Job Job Job Job Job Job Job JobJob Job Hive/Pig SSH Script J J… J J Hive J J J Pig …

22. Data collection : Data profiles view Data product Data set Data set = Data Set = Transformation Record Size & Type Job Counts Join success ratios Data Set Consistency

23. Data Collection Hierarchy wk_external_events wk_build_profile user_profile extract_fields consolidate_metrics load_into_data_centers extract_features compact_user_profile Workflow/Job/Script StepData Product

24. Golden Input / Shadow Cluster •Integration tests on realistic data sets. •Safe environment to innovate.

25. Dashboard • Delivery Time • Data Profile Ratios • Counters • Alarms

26. Thank you www.rocketfuel.com

Designing Data Pipelines Using Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Designing Data Pipelines Using Hadoop

Semelhante a Designing Data Pipelines Using Hadoop (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Designing Data Pipelines Using Hadoop