Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by Abraham Elmahrek of Cloudera

Lessons Learned
Designing Data Ingest Systems
Abraham Elmahrek (abe@apache.org)

1. Overview of Big Data Ingest
2. Real world examples with lessons interleaved
3. A summary of lessons learned and extra ideas
Agenda

Big Data Ingest
Ingesting from different data
sources is the goal
Several data sources have
different structures, but
schemas vary mostly
Batch and Real Time ingest
both have their places
Data sources Schema Speed

Data sources
Relational databases,
spreadsheets, object
databases
XML, JSON, EDI, etc. Audio, video, email, etc.
Structured Semi-structured Unstructured

Schema
One schema with a relatively
flat structure or many
schemas with nested
structures.
Immutable schemas can’t be
changed. Mutable schemas
can evolve. Nested schemas
can also have mutability
properties.
Number of schemas Mutability Inference
Schema inference upon
writing, reading, or offline.

Real Time vs Batch
Push data from A -> B on
demand.
Push data from A -> B
consistently. Poll on data
sources or act upon
reception.
Batch Push model Pull model
Clients pull data from A to
write to B. Often times an
intermediate storage system
like Kafka is used to achieve
this.

• GOAL: Generate different forms for
websites
• Store user information
• Forms cannot change over time
Real world scenario: Form generator

Lesson #1: Structure endpoint wisely
Form Definition
id
form name
form metadata
Form 1
id
<field 1>
<field 2>
<field 3>
Form 2
id
<field 1>
<field 2>
<field 3>
Form Definition
id
form name
Field Definition
id
form id
field name
type
Field Values
id
field id
value

• GOAL: Generate list of active contributors on a repository and
general stats about a repository relative to all other repositories.
• Scheduled batch Change Data Capture (CDC).
Real world scenario: Scrape github

• Ingesting data twice doesn’t matter in a lot of cases.
• The cost of re-processing or re-ingesting a few records is
normally pretty low.
• It’s easy to manage and implement.
• Exactly once semantics, in contrast, is not feasible
– Usually requires some de-duping
Lesson #2: At least once is acceptable

• Change Data Capture (CDC) without a change log or an easy
way to calculate differences is hard.
• Almost always requires some customized effort.
Lesson #3: CDC is hard

• GOAL: Gather impressions and click information. Attribute to
different vendors based on impressions and clicks.
• Expose a view for customers to understand their usage.
• NRT with batch error checking.
Real world scenario: Ad attribution system

• What is the incidence of errors?
• How frequently should errors be checked?
• Is data loss acceptable?
• Is duplication acceptable?
Lesson #4: Know thy SLA

Push version
Click Logs
Impression
Logs
V
I
P
Scribe
Master
Scribe
Master
Scribe
Master
HBase
MySQL

Push version analysis
• Negatives
– Scribe would lose data in some edge cases. That’s not good for
attribution systems (money involved).
– Amount of messages being written to HBase would cause major
compactions on a weekly basis halting the pipeline.
• Positives
– Latency was super low
– Relatively easy to maintain given scribe configuration
* Flume would have been a better choice!

Pull version
Click Logs
Impression
Logs
V
I
P
Producer HBase
MySQL
Producer
RabbitMQ
Consumer
Consumer

Pull version analysis
• Negatives
– Requires more management and configuration.
• Positives
– Choose data loss with at most once or at least once semantics.
– Intermediate storage relieves HBase.
* Kafka would have been a cool choice!

1. Structureless (or simple structure) and schemaless
a. Log file (e.g. uuid|val1|val2|val3|...)
2. Structured without schema
a. JSON (e.g. {“key1”: “val1”, ...})
3. Structured with schema
a. Avro (e.g. {“key1”: “val1”, ...}, but with schema)
Lesson #5: Record format and schema

• Verbosity directly related to human readability
• Verbosity impacts performance of systems
• A verbose and readable RPC: XML, YAML, JSON, etc.
• A not-so-verbose and not-so-readable RPC: MessagePack,
Protobuf, Avro binary, etc.
Issues with structure

• Flexibility and structure are inversely proportional.
• A flexible schema
– Doesn’t require upfront an definition
– Easy to extend, but difficult to track changes
– May have nested structures
– e.g. uuid|val1|val2|val3|...
• A structured schema
– Fully describes in detail all data
– Is more logically ordered
Issues with schema

• Where is the data coming from?
• How has it changed as it enters the system?
• Snapshots?
Lesson #6: Record lineage

1. Structure endpoints wisely
2. At least once semantics is easy and acceptable
3. CDC is hard
4. Know thy SLA
5. Record format and schema should be thought through
6. Record lineage (provenance)
Summary of lessons

1. Keep track of erroneous records
a. Anomalies lead to more knowledge about data source
b. Improves debugging
2. Keep transformations to a minimum
a. Schema inference makes sense
b. Massive computations can slow down the ingest process and cause
back pressure in the pipeline
Extra ideas

Checkout
http://ingest.tips
for general ingest

Licensing
Public Domain
1. https://commons.wikimedia.org/wiki/File:West_Texas_Pumpjack.JPG
2. https://commons.wikimedia.org/wiki/File%3ABulls_Ishikawa%2C_Okinawa_2007.jpg
3. https://commons.wikimedia.org/wiki/File:Hammer_Ace_SATCOM_Antenna.jpg
4. https://commons.wikimedia.org/wiki/File:Shanghai_Shimao_Plaza_Construction.jpg
5. https://pixabay.com/p-111058/?no_redirect
6. https://pixabay.com/p-70908/?no_redirect
7. https://commons.wikimedia.org/wiki/File:
The_Sun_by_the_Atmospheric_Imaging_Assembly_of_NASA's_Solar_Dynamics_Observatory_-_20100819.jpg
8. http://www.freestockphotos.biz/stockphoto/16694
9. https://pixabay.com/en/github-logo-favicon-mascot-button-154769/
Creative Commons V3
10. https://commons.wikimedia.org/wiki/File:Star-schema.png
Creative Commons V2
11. https://www.flickr.com/photos/the_pink_princess/370896536/
12. https://www.flickr.com/photos/digitaljourney/5424241457

Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by Abraham Elmahrek of Cloudera

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by Abraham Elmahrek of Cloudera

Semelhante a Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by Abraham Elmahrek of Cloudera (20)

Mais de Data Con LA

Mais de Data Con LA (20)

Último

Último (20)

Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by Abraham Elmahrek of Cloudera