Apache Kafka and Amazon Kinesis are more than just message queues — they can serve as a unified log which you can put at the heart of your business, effectively creating a "digital nervous system" which your company's applications and processes can be re-structured around.
In this talk, Alex will provide an introduction to unified log technology, highlight some killer use cases and also show how Kinesis is being used "in anger" at Snowplow. Alex's talk will draw on his experiences working with event streams over the last two and a half years at Snowplow; it’s also heavily influenced by Jay Kreps’ unified log monograph, and by Alex's recent work penning Unified Log Processing, a Manning book. Alex's talk will show how event streams inside a unified log are an incredibly powerful primitive for building rich event-centric applications, unbundling local transactional silos and creating a single version of truth for a company.
Alex's talk will conclude with a live demo of Amazon Kinesis in action processing Snowplow events.
Span Conference: Why your company needs a unified log
1. Why
your
company
needs
a
Unified
Log
Span
Conference,
London,
28th
October
2014
2. Introducing
myself
• Alex
Dean
• Co-‐founder
and
technical
lead
at
Snowplow,
the
open-‐source
event
analyBcs
plaCorm
based
here
in
London
[1]
• Weekend
writer
of
Unified
Log
Processing,
available
on
the
Manning
Early
Access
Program
[2]
[1]
hNps://github.com/snowplow/snowplow
[2]
hNp://manning.com/dean
4. A
quick
history
lesson:
the
three
eras
of
business
data
processing
[1]
1. The
classic
era,
1996+
2. The
hybrid
era,
2005+
3. The
unified
era,
2013+
[1]
hNp://snowplowanalyBcs.com/blog/
2014/01/20/the-‐three-‐eras-‐of-‐business-‐data-‐processing/
5. The
classic
era
of
business
data
processing,
1996+
OWN
DATA
CENTER
NARROW
DATA
SILOES
LOW
LATENCY
LOCAL
LOOPS
Point-‐to-‐point
connec+ons
HIGH
LATENCY
Data
warehouse
WIDE
DATA
COVERAGE
CMS
Silo
CRM
E-‐comm
Local
loop
Local
loop
Silo
Local
loop
Management
reporBng
ERP
Silo
Local
loop
Silo
Nightly
batch
ETL
process
FULL
DATA
HISTORY
6. The
hybrid
era,
2005+
CLOUD
VENDOR
/
OWN
DATA
CENTER
Search
Silo
Local
loop
LOW
LATENCY
LOCAL
LOOPS
E-‐comm
Silo
Local
loop
CRM
Local
loop
SAAS
VENDOR
#2
Email
markeBng
Local
loop
ERP
Silo
Local
loop
CMS
Silo
Local
loop
SAAS
VENDOR
#1
NARROW
DATA
SILOES
Stream
processing
Product
rec’s
Micro-‐batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporBng
Batch
processing
Hadoop
Ad
hoc
analyBcs
SAAS
VENDOR
#3
Web
analyBcs
Local
loop
Local
loop
Local
loop
LOW
LATENCY
LOW
LATENCY
HIGH
LATENCY
HIGH
LATENCY
APIs
Bulk
exports
7. The
hybrid
era:
a
surfeit
of
soNware
vendors
CLOUD
VENDOR
/
OWN
DATA
CENTER
Search
Silo
Local
loop
LOW
LATENCY
LOCAL
LOOPS
E-‐comm
Silo
Local
loop
CRM
Local
loop
SAAS
VENDOR
#2
Email
markeBng
Local
loop
ERP
Silo
Local
loop
CMS
Silo
Local
loop
SAAS
VENDOR
#1
NARROW
DATA
SILOES
Stream
processing
Product
rec’s
Micro-‐batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporBng
Batch
processing
Hadoop
Ad
hoc
analyBcs
SAAS
VENDOR
#3
Web
analyBcs
Local
loop
Local
loop
Local
loop
LOW
LATENCY
LOW
LATENCY
HIGH
LATENCY
HIGH
LATENCY
APIs
Bulk
exports
8. The
hybrid
era:
company-‐wide
reporQng
and
analyQcs
ends
up
like
Rashomon
The
bandit’s
story
vs.
The
wife’s
story
vs.
The
samurai’s
story
vs.
The
woodcuNer’s
story
9. The
hybrid
era:
the
number
of
data
integraQons
is
unsustainable
11. The
unified
era,
2013+
CLOUD
VENDOR
/
OWN
DATA
CENTER
Search
Silo
SOME
LOW
LATENCY
LOCAL
LOOPS
E-‐comm
Silo
CRM
SAAS
VENDOR
#2
Email
markeBng
ERP
Silo
CMS
Silo
SAAS
VENDOR
#1
NARROW
DATA
SILOES
Streaming
APIs
/
web
hooks
LOW
LATENCY
WIDE
DATA
Unified
log
COVERAGE
Archiving
Hadoop
<
WIDE
DATA
COVERAGE
>
<
FULL
DATA
HISTORY
>
FEW
DAYS’
DATA
HISTORY
Systems
monitoring
Eventstream
Ad
hoc
HIGH
LATENCY
LOW
LATENCY
Product
rec’s
analyBcs
Management
reporBng
Fraud
detecBon
Churn
prevenBon
APIs
12. The
unified
log
is
Amazon
Kinesis,
or
Apache
KaVa
CLOUD
VENDOR
/
OWN
DATA
CENTER
Search
Silo
SOME
LOW
LATENCY
LOCAL
LOOPS
E-‐comm
Silo
CRM
SAAS
VENDOR
#2
Email
markeBng
ERP
Silo
CMS
Silo
SAAS
VENDOR
#1
NARROW
DATA
SILOES
Streaming
APIs
/
web
hooks
Unified
log
Archiving
Hadoop
<
WIDE
DATA
COVERAGE
>
<
FULL
DATA
HISTORY
>
Systems
monitoring
Eventstream
Ad
hoc
HIGH
LATENCY
LOW
LATENCY
Product
rec’s
analyBcs
Management
reporBng
Fraud
detecBon
Churn
prevenBon
APIs
• Amazon
Kinesis,
a
hosted
AWS
service
• Extremely
similar
semanBcs
to
Kaba
• Apache
Kaba,
an
append-‐
only,
distributed,
ordered
commit
log
• Developed
at
LinkedIn
to
serve
as
their
organizaBon’s
unified
log
13. “Kaba
is
designed
to
allow
a
single
cluster
to
serve
as
the
central
data
backbone
for
a
large
organizaBon”
[1]
[1]
hNp://kaba.apache.org/
14. So
what
does
a
unified
log
give
us?
A
single
version
of
the
truth
Our
truth
is
now
upstream
from
the
data
warehouse
The
hairball
of
point-‐to-‐point
connecQons
has
been
unravelled
Local
loops
have
been
unbundled
1
2
3
4
15. What
does
a
unified
log
let
us
do
that
we
couldn’t
do
before?
PopulaQng
a
unified
log
with
your
company’s
event
streams
Real-‐Bme
management
reporBng
To
enable…
HolisBc
systems
monitoring
Re-‐running
models
from
Day
0
A/B
tesBng
end-‐to-‐end
pipelines
Shipping
offline
models
to
RT
…
anything
requiring
low
latency
response
/
holis+c
view
of
our
company’s
data!
16. But
garbage
in,
garbage
out:
it’s
crucial
to
properly
model
the
event
streams
feeding
into
the
unified
log
Subject
Direct
Object
Indirect
Verb
Object
Event
Context
Prep.
~
Object
• We
are
working
on
a
semanBc
model
for
events
–
an
“event
grammar”
at
Snowplow
[1]
• The
event
grammar
borrows
concepts
from
human
language:
• A
semanBc
model
prevents
business
and
technology
assumpBons
leaking
in
to
the
event
stream
–
making
it
less
briNle
over
Bme
[1]
hNp://snowplowanalyBcs.com/blog/2013/08/12/
towards-‐universal-‐event-‐analyBcs-‐building-‐an-‐event-‐grammar/
17. We
also
need
to
store
and
version
the
schemas
used
to
describe
our
events,
as
these
will
change
over
Qme
Unified
log
18. How
are
we
embracing
the
unified
log
at
Snowplow?
19. Some
background:
early
on,
we
decided
that
Snowplow
should
be
composed
of
a
set
of
loosely
coupled
subsystems
1.
Trackers
2.
Collectors
3.
Enrich
4.
Storage
5.
AnalyBcs
Generate
event
data
from
any
environment
Log
raw
events
from
trackers
Validate
and
enrich
raw
events
=
Standardised
data
protocols
Store
enriched
events
ready
for
analysis
Analyze
enriched
events
These
turned
out
to
be
criBcal
to
allowing
us
to
evolve
the
above
stack
20. Today
almost
all
users/customers
are
running
a
batch-‐based
Snowplow
configuraQon
Hadoop-‐
based
enrichment
Snowplow
event
tracking
SDK
Amazon
S3
Amazon
Redshik
HTTP-‐based
event
collector
• Batch-‐based
• Normally
run
overnight;
The
Snowplow
batch-‐based
someBmes
every
4-‐6
hours
flow
uses
Amazon
S3
as
a
“poor
man’s”
unified
log
21. Can
we
implement
Snowplow
on
top
of
Kinesis/KaVa?
CLOUD
VENDOR
/
OWN
DATA
CENTER
Search
Silo
SOME
LOW
LATENCY
LOCAL
LOOPS
E-‐comm
Silo
CRM
SAAS
VENDOR
#2
Email
markeBng
ERP
Silo
CMS
Silo
SAAS
VENDOR
#1
NARROW
DATA
SILOES
Streaming
APIs
/
web
hooks
Unified
log
Archiving
Hadoop
<
WIDE
DATA
COVERAGE
>
<
FULL
DATA
HISTORY
>
Systems
monitoring
Eventstream
Ad
hoc
HIGH
LATENCY
LOW
LATENCY
Product
rec’s
analyBcs
Management
reporBng
Fraud
detecBon
Churn
prevenBon
APIs
22. We
are
working
on
Amazon
Kinesis
support
first;
Apache
KaVa
will
come
later
(using
Apache
Samza
for
stream
processing)
Scala
Stream
Collector
Raw
event
stream
Enrich
Kinesis
app
Bad
raw
events
stream
Enriched
event
stream
S3
Redshik
S3
sink
Kinesis
app
Redshik
sink
Kinesis
app
Snowplow
Trackers
=
not
yet
released
ElasBc-‐
Search
sink
Kinesis
app
DynamoDB
ElasBc-‐
Search
Event
aggregator
Kinesis
app
AnalyQcs
on
Read
(for
agile
exploraBon
of
event
stream,
ML,
audiBng,
applying
alternate
models,
reprocessing
etc)
AnalyQcs
on
Write
(for
dashboarding,
audience
segmentaBon,
RTB,
etc)
24. QuesQons?
Discount
code:
spancNw
(43%
off
all
Manning
eBooks
for
Span
J)
hNp://snowplowanalyBcs.com
hNps://github.com/snowplow/snowplow
@snowplowdata
To
meet
up
or
chat,
@alexcrdean
on
TwiNer
or
alex@snowplowanalyBcs.com