SlideShare uma empresa Scribd logo
1 de 29
Lessons Learned
Designing Data Ingest Systems
Abraham Elmahrek (abe@apache.org)
1. Overview of Big Data Ingest
2. Real world examples with lessons interleaved
3. A summary of lessons learned and extra ideas
Agenda
Big Data Ingest
Ingesting from different data
sources is the goal
Several data sources have
different structures, but
schemas vary mostly
Batch and Real Time ingest
both have their places
Data sources Schema Speed
Data sources
Relational databases,
spreadsheets, object
databases
XML, JSON, EDI, etc. Audio, video, email, etc.
Structured Semi-structured Unstructured
Schema
One schema with a relatively
flat structure or many
schemas with nested
structures.
Immutable schemas can’t be
changed. Mutable schemas
can evolve. Nested schemas
can also have mutability
properties.
Number of schemas Mutability Inference
Schema inference upon
writing, reading, or offline.
Real Time vs Batch
Push data from A -> B on
demand.
Push data from A -> B
consistently. Poll on data
sources or act upon
reception.
Batch Push model Pull model
Clients pull data from A to
write to B. Often times an
intermediate storage system
like Kafka is used to achieve
this.
• GOAL: Generate different forms for
websites
• Store user information
• Forms cannot change over time
Real world scenario: Form generator
Lesson #1: Structure endpoint wisely
Form Definition
id
form name
form metadata
Form 1
id
<field 1>
<field 2>
<field 3>
Form 2
id
<field 1>
<field 2>
<field 3>
Form Definition
id
form name
Field Definition
id
form id
field name
type
Field Values
id
field id
value
• GOAL: Generate list of active contributors on a repository and
general stats about a repository relative to all other repositories.
• Scheduled batch Change Data Capture (CDC).
Real world scenario: Scrape github
My implementation (naive)
• Ingesting data twice doesn’t matter in a lot of cases.
• The cost of re-processing or re-ingesting a few records is
normally pretty low.
• It’s easy to manage and implement.
• Exactly once semantics, in contrast, is not feasible
– Usually requires some de-duping
Lesson #2: At least once is acceptable
A better implementation
My favorite implementation
• Change Data Capture (CDC) without a change log or an easy
way to calculate differences is hard.
• Almost always requires some customized effort.
Lesson #3: CDC is hard
• GOAL: Gather impressions and click information. Attribute to
different vendors based on impressions and clicks.
• Expose a view for customers to understand their usage.
• NRT with batch error checking.
Real world scenario: Ad attribution system
• What is the incidence of errors?
• How frequently should errors be checked?
• Is data loss acceptable?
• Is duplication acceptable?
Lesson #4: Know thy SLA
Push version
Click Logs
Impression
Logs
V
I
P
Scribe
Master
Scribe
Master
Scribe
Master
HBase
MySQL
Push version analysis
• Negatives
– Scribe would lose data in some edge cases. That’s not good for
attribution systems (money involved).
– Amount of messages being written to HBase would cause major
compactions on a weekly basis halting the pipeline.
• Positives
– Latency was super low
– Relatively easy to maintain given scribe configuration
* Flume would have been a better choice!
Pull version
Click Logs
Impression
Logs
V
I
P
Producer HBase
MySQL
Producer
RabbitMQ
Consumer
Consumer
Pull version analysis
• Negatives
– Requires more management and configuration.
• Positives
– Choose data loss with at most once or at least once semantics.
– Intermediate storage relieves HBase.
* Kafka would have been a cool choice!
1. Structureless (or simple structure) and schemaless
a. Log file (e.g. uuid|val1|val2|val3|...)
2. Structured without schema
a. JSON (e.g. {“key1”: “val1”, ...})
3. Structured with schema
a. Avro (e.g. {“key1”: “val1”, ...}, but with schema)
Lesson #5: Record format and schema
• Verbosity directly related to human readability
• Verbosity impacts performance of systems
• A verbose and readable RPC: XML, YAML, JSON, etc.
• A not-so-verbose and not-so-readable RPC: MessagePack,
Protobuf, Avro binary, etc.
Issues with structure
• Flexibility and structure are inversely proportional.
• A flexible schema
– Doesn’t require upfront an definition
– Easy to extend, but difficult to track changes
– May have nested structures
– e.g. uuid|val1|val2|val3|...
• A structured schema
– Fully describes in detail all data
– Is more logically ordered
Issues with schema
• Where is the data coming from?
• How has it changed as it enters the system?
• Snapshots?
Lesson #6: Record lineage
1. Structure endpoints wisely
2. At least once semantics is easy and acceptable
3. CDC is hard
4. Know thy SLA
5. Record format and schema should be thought through
6. Record lineage (provenance)
Summary of lessons
1. Keep track of erroneous records
a. Anomalies lead to more knowledge about data source
b. Improves debugging
2. Keep transformations to a minimum
a. Schema inference makes sense
b. Massive computations can slow down the ingest process and cause
back pressure in the pipeline
Extra ideas
Checkout
http://ingest.tips
for general ingest
Thank you
Licensing
Public Domain
1. https://commons.wikimedia.org/wiki/File:West_Texas_Pumpjack.JPG
2. https://commons.wikimedia.org/wiki/File%3ABulls_Ishikawa%2C_Okinawa_2007.jpg
3. https://commons.wikimedia.org/wiki/File:Hammer_Ace_SATCOM_Antenna.jpg
4. https://commons.wikimedia.org/wiki/File:Shanghai_Shimao_Plaza_Construction.jpg
5. https://pixabay.com/p-111058/?no_redirect
6. https://pixabay.com/p-70908/?no_redirect
7. https://commons.wikimedia.org/wiki/File:
The_Sun_by_the_Atmospheric_Imaging_Assembly_of_NASA's_Solar_Dynamics_Observatory_-_20100819.jpg
8. http://www.freestockphotos.biz/stockphoto/16694
9. https://pixabay.com/en/github-logo-favicon-mascot-button-154769/
Creative Commons V3
10. https://commons.wikimedia.org/wiki/File:Star-schema.png
Creative Commons V2
11. https://www.flickr.com/photos/the_pink_princess/370896536/
12. https://www.flickr.com/photos/digitaljourney/5424241457

Mais conteúdo relacionado

Mais procurados

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
HostedbyConfluent
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 

Mais procurados (20)

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Ubiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinUbiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil Twin
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
How and when to use NoSQL
How and when to use NoSQLHow and when to use NoSQL
How and when to use NoSQL
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Apache Superset at Airbnb
Apache Superset at AirbnbApache Superset at Airbnb
Apache Superset at Airbnb
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Traveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analyticsTraveloka's journey to no ops streaming analytics
Traveloka's journey to no ops streaming analytics
 
Ramunas Balukonis. Research DWH
Ramunas Balukonis. Research DWHRamunas Balukonis. Research DWH
Ramunas Balukonis. Research DWH
 

Destaque

Destaque (20)

Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
 
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kelly
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Big datacamp june14_alex_liu
Big datacamp june14_alex_liuBig datacamp june14_alex_liu
Big datacamp june14_alex_liu
 
Ag big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalAg big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopal
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
 
Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcott
 
Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jha
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 

Semelhante a Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by Abraham Elmahrek of Cloudera

Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
Zhang Bo
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
elliando dias
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
Ben Stopford
 

Semelhante a Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by Abraham Elmahrek of Cloudera (20)

Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest SystemsBig Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 
MongoDB
MongoDBMongoDB
MongoDB
 
Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodes
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
PPL, OQL & oodbms
PPL, OQL & oodbmsPPL, OQL & oodbms
PPL, OQL & oodbms
 
1_DBMS_Introduction.pdf
1_DBMS_Introduction.pdf1_DBMS_Introduction.pdf
1_DBMS_Introduction.pdf
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
Scalability Considerations
Scalability ConsiderationsScalability Considerations
Scalability Considerations
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
no sql presentation
no sql presentationno sql presentation
no sql presentation
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Unit 01 dbms
Unit 01 dbmsUnit 01 dbms
Unit 01 dbms
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdf
 

Mais de Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Mais de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by Abraham Elmahrek of Cloudera

  • 1. Lessons Learned Designing Data Ingest Systems Abraham Elmahrek (abe@apache.org)
  • 2. 1. Overview of Big Data Ingest 2. Real world examples with lessons interleaved 3. A summary of lessons learned and extra ideas Agenda
  • 3. Big Data Ingest Ingesting from different data sources is the goal Several data sources have different structures, but schemas vary mostly Batch and Real Time ingest both have their places Data sources Schema Speed
  • 4. Data sources Relational databases, spreadsheets, object databases XML, JSON, EDI, etc. Audio, video, email, etc. Structured Semi-structured Unstructured
  • 5. Schema One schema with a relatively flat structure or many schemas with nested structures. Immutable schemas can’t be changed. Mutable schemas can evolve. Nested schemas can also have mutability properties. Number of schemas Mutability Inference Schema inference upon writing, reading, or offline.
  • 6. Real Time vs Batch Push data from A -> B on demand. Push data from A -> B consistently. Poll on data sources or act upon reception. Batch Push model Pull model Clients pull data from A to write to B. Often times an intermediate storage system like Kafka is used to achieve this.
  • 7. • GOAL: Generate different forms for websites • Store user information • Forms cannot change over time Real world scenario: Form generator
  • 8. Lesson #1: Structure endpoint wisely Form Definition id form name form metadata Form 1 id <field 1> <field 2> <field 3> Form 2 id <field 1> <field 2> <field 3> Form Definition id form name Field Definition id form id field name type Field Values id field id value
  • 9. • GOAL: Generate list of active contributors on a repository and general stats about a repository relative to all other repositories. • Scheduled batch Change Data Capture (CDC). Real world scenario: Scrape github
  • 11. • Ingesting data twice doesn’t matter in a lot of cases. • The cost of re-processing or re-ingesting a few records is normally pretty low. • It’s easy to manage and implement. • Exactly once semantics, in contrast, is not feasible – Usually requires some de-duping Lesson #2: At least once is acceptable
  • 14. • Change Data Capture (CDC) without a change log or an easy way to calculate differences is hard. • Almost always requires some customized effort. Lesson #3: CDC is hard
  • 15. • GOAL: Gather impressions and click information. Attribute to different vendors based on impressions and clicks. • Expose a view for customers to understand their usage. • NRT with batch error checking. Real world scenario: Ad attribution system
  • 16. • What is the incidence of errors? • How frequently should errors be checked? • Is data loss acceptable? • Is duplication acceptable? Lesson #4: Know thy SLA
  • 18. Push version analysis • Negatives – Scribe would lose data in some edge cases. That’s not good for attribution systems (money involved). – Amount of messages being written to HBase would cause major compactions on a weekly basis halting the pipeline. • Positives – Latency was super low – Relatively easy to maintain given scribe configuration * Flume would have been a better choice!
  • 19. Pull version Click Logs Impression Logs V I P Producer HBase MySQL Producer RabbitMQ Consumer Consumer
  • 20. Pull version analysis • Negatives – Requires more management and configuration. • Positives – Choose data loss with at most once or at least once semantics. – Intermediate storage relieves HBase. * Kafka would have been a cool choice!
  • 21. 1. Structureless (or simple structure) and schemaless a. Log file (e.g. uuid|val1|val2|val3|...) 2. Structured without schema a. JSON (e.g. {“key1”: “val1”, ...}) 3. Structured with schema a. Avro (e.g. {“key1”: “val1”, ...}, but with schema) Lesson #5: Record format and schema
  • 22. • Verbosity directly related to human readability • Verbosity impacts performance of systems • A verbose and readable RPC: XML, YAML, JSON, etc. • A not-so-verbose and not-so-readable RPC: MessagePack, Protobuf, Avro binary, etc. Issues with structure
  • 23. • Flexibility and structure are inversely proportional. • A flexible schema – Doesn’t require upfront an definition – Easy to extend, but difficult to track changes – May have nested structures – e.g. uuid|val1|val2|val3|... • A structured schema – Fully describes in detail all data – Is more logically ordered Issues with schema
  • 24. • Where is the data coming from? • How has it changed as it enters the system? • Snapshots? Lesson #6: Record lineage
  • 25. 1. Structure endpoints wisely 2. At least once semantics is easy and acceptable 3. CDC is hard 4. Know thy SLA 5. Record format and schema should be thought through 6. Record lineage (provenance) Summary of lessons
  • 26. 1. Keep track of erroneous records a. Anomalies lead to more knowledge about data source b. Improves debugging 2. Keep transformations to a minimum a. Schema inference makes sense b. Massive computations can slow down the ingest process and cause back pressure in the pipeline Extra ideas
  • 29. Licensing Public Domain 1. https://commons.wikimedia.org/wiki/File:West_Texas_Pumpjack.JPG 2. https://commons.wikimedia.org/wiki/File%3ABulls_Ishikawa%2C_Okinawa_2007.jpg 3. https://commons.wikimedia.org/wiki/File:Hammer_Ace_SATCOM_Antenna.jpg 4. https://commons.wikimedia.org/wiki/File:Shanghai_Shimao_Plaza_Construction.jpg 5. https://pixabay.com/p-111058/?no_redirect 6. https://pixabay.com/p-70908/?no_redirect 7. https://commons.wikimedia.org/wiki/File: The_Sun_by_the_Atmospheric_Imaging_Assembly_of_NASA's_Solar_Dynamics_Observatory_-_20100819.jpg 8. http://www.freestockphotos.biz/stockphoto/16694 9. https://pixabay.com/en/github-logo-favicon-mascot-button-154769/ Creative Commons V3 10. https://commons.wikimedia.org/wiki/File:Star-schema.png Creative Commons V2 11. https://www.flickr.com/photos/the_pink_princess/370896536/ 12. https://www.flickr.com/photos/digitaljourney/5424241457