A recap on the Unified Log "manifesto" for new ULPers, with my regular presentation on "Why your company needs a Unified Log". Given at Unified Log London in May 2015
Unified Log London (May 2015) - Why your company needs a unified log
1. Why your company needs a
Unified Log
Unified Log London, 20th May 2015
2. Introducing myself
• Alex Dean
• Co-founder and technical lead at Snowplow,
the open-source event analytics platform
based here in London [1]
• Weekend writer of Unified Log Processing,
available on the Manning Early Access Program
[2]
[1] https://github.com/snowplow/snowplow
[2] http://manning.com/dean
4. A quick history lesson: the three eras of business data
processing [1]
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
[1] http://snowplowanalytics.com/blog/
2014/01/20/the-three-eras-of-business-data-processing/
5. The classic era of business data processing, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point
connections
WIDE DATA
COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
Management
reporting
ERP
Silo
Local loop
Silo
Nightly batch ETL process
FULL DATA
HISTORY
6. The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
Local loop
LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
CRM
Local loop
SAAS VENDOR #2
Email
marketing
Local loop
ERP
Silo
Local loop
CMS
Silo
Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream
processing
Product
rec’s
Micro-batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporting
Batch
processing
Ad hoc
analytics
Hadoop
SAAS VENDOR #3
Web
analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
7. The hybrid era: a surfeit of software vendors
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
Local loop
LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
CRM
Local loop
SAAS VENDOR #2
Email
marketing
Local loop
ERP
Silo
Local loop
CMS
Silo
Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream
processing
Product
rec’s
Micro-batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporting
Batch
processing
Ad hoc
analytics
Hadoop
SAAS VENDOR #3
Web
analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
8. The hybrid era: company-wide reporting and
analytics ends up like Rashomon
The bandit’s story
vs.
The wife’s story
vs.
The samurai’s story
vs.
The woodcutter’s story
9. The hybrid era: the number of data integrations
is unsustainable
11. The unified era, 2013+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’
DATA HISTORY
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
12. CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
The unified log is Amazon Kinesis, or Apache Kafka
• Amazon Kinesis, a
hosted AWS service
• Extremely similar
semantics to Kafka
• Apache Kafka, an append-
only, distributed, ordered
commit log
• Developed at LinkedIn to
serve as their
organization’s unified log
13. “Kafka is designed to allow a
single cluster to serve as the
central data backbone for a
large organization” [1]
[1] http://kafka.apache.org/
14. So what does a unified log give us?
A single version of the truth
Our truth is now upstream from the data warehouse
The hairball of point-to-point connections has been
unravelled
Local loops have been unbundled
1
2
3
4
15. What does a unified log let us do that we couldn’t do before?
Populating a unified log with
your company’s event streams
Real-time
management
reporting
To enable…
Holistic
systems
monitoring
Re-running
models from
Day 0
A/B testing
end-to-end
pipelines
Shipping
offline
models to RT
… anything requiring low
latency response /
holistic view of our
company’s data!
16. But garbage in, garbage out: it’s crucial to properly model the
event streams feeding into the unified log
Subject
Direct
Object
Indirect
Object
Verb
Event Context
Prep.
Object~
• We are working on a semantic model for events – an “event
grammar” at Snowplow [1]
• The event grammar borrows concepts from human language:
• A semantic model prevents business and technology assumptions
leaking in to the event stream – making it less brittle over time
[1] http://snowplowanalytics.com/blog/2013/08/12/
towards-universal-event-analytics-building-an-event-grammar/
17. We also need to store and version the schemas used to describe
our events, as these will change over time
Unified
log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log