Presentation shows how we started doing Big Data in Ocado, what obstacles we hit and how we tried to fix this later. You'll see how to deal with data sources, or most importatly, how not to deal with them.
1. The Big Bad Data
or how to deal with data sources
Przemysław Pastuszka, Kraków, 17.10.2014
2. What do I want to talk about?
● Quick introduction to Ocado
● First approach to Big Data and why we
ended up with Bad Data
● Making things better - Unified data
architecture
● Live demo (if there is time left)
3. Ocado intro
Ocado is the world's largest online-only
grocery retailer, reaching over 70% of
British households, shipping over 150,000
orders a week or 1.1M items a day.
6. How did we start?
Google
Big Query
OCADO SERVICES
Oracle
Green
plum
JMS
Google
Cloud
Storage
Compute
cluster
User
Cluster
Cluster Manager
Raw data
Transformed ORC files
7. Looks good. So what’s the problem?
● Various data formats
○ json, csv, uncompressed, gzip, blob, nested…
○ incremental data, “snapshot” data, deltas
○ lots of code to handle all corner cases
● Corrupted data
○ corrupted gzips, empty files, invalid content, unexpected
schema changes …
○ failures of overnight jobs
● Data exports delayed
○ DB downtime, network issues, human error, …
○ data available in BD platform even later
9. That’s not all...
● Real-time analytics?
○ data comes in batches every night
○ so forget it
● ORC is not a dream solution
○ not a very friendly format
○ overnight transform is a next point of failure
○ data duplication (raw + ORC)
● People think you “own” the data
○ “please, tell me what this data means?”
10. People get frustrated
● Big Data team is frustrated
○ we spend lots of time on monitoring and fixing bugs
○ code becomes complex to handle corner cases
○ confidence in platform stability rapidly goes down
● Analytics are frustrated
○ long latency before data is available for querying
○ data is unreliable
11. It can’t go on like this anymore
Let’s go back to the board!
12. What we need?
● Unified input data format
○ JSON to the rescue
● Data goes directly from applications to BD platform
○ let’s make all external services write to distributed queue
● Data validation and monitoring
○ all data incoming into system should be well-described
○ validation should happen as early as possible
○ alerts on corrupted / missing data must be raised early
● Data must be ready for querying early
○ let’s push data to BigQuery first
13. New architecture overview
INPUT STREAM
Event Registry
Data
Storage
Event
ProcEevsesnotr
ProcEevsesnotr
Processor
Compute Cloud
ENDPOINTS
Cluster Manager
14. Loading data
validated
events
stream
INPUT STREAM
Kinesis
Event Registry
BQ
event type descriptor
- schema
- processing instructions
ad-hoc /
scheduled
export
Google
Cloud Storage
raw
events
stream
BQ Excel
Connector
BQ Tableau
Connector
BQ REST API
invalid events
store location
( BQ )
invalid
events
ad hoc
events replay
Event
Processor
GS REST API
gsutil
15. Batch processing
BQ
Google
Cloud Storage
BQ Excel
Connector
BQ Tableau
Connector
BQ REST API
BQ query
export
COMPUTE CLOUD
Compute
Cluster A
Compute
Cluster B
Compute
Cluster C
GS REST API
gsutil
16. Real-time processing
INPUT
STREAM
Kinesis
Event Registry
BQ
event type descriptor
- schema
- processing instructions
Google
Cloud Storage
validated
events
stream
raw
events
stream
BQ Excel
Connector
BQ Tableau
Connector
BQ REST API
Cluster A
EVENT QUEUE BQ Sink
Cluster B
Event
Processor
GS Sink
processed data ready
for consumption by
other
GS REST API
gsutil