On large-scale web sites, users leave thousands of traces every second. Businesses need to process and interpret these traces in real-time to be able to react on the behavior of their users.
In this talk, Andreas will show a real world example of the power of a modern open-source stack.
He will walk you through the design of a real-time clickstream analysis PAAS solution based on Apache Spark, Kafka, Parquet and HDFS, explain our decision making and present our lessons learned.
3. ONE POT TO RULE THEM ALL
Web Tracking Ad Tracking
ERP
CRM
▪
Products
▪
Inventory
▪
Margins
▪
Customer
▪
Orders
▪
Creditworthiness
▪
Ad Impressions
▪
Ad Costs
▪
Clicks &
Views
▪
Conversions
4. ONE POT TO RULE THEM ALL
Retention Reach
Monetarization
steer …
▪ Campaigns
▪ Offers
▪ Contents
5. REACT ON WEB SITE
TRAFFIC IN REAL TIME
Image: https://www.flickr.com/photos/nick-m/3663923048
10. CALCULATING USER JOURNEYS
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs:
▪ Unique users
▪ Conversions
▪ Ad costs / conversion value
▪ …
V
X
VT
C Click
View
View Time
Conversion
14. „HADOOP & FRIENDS“ ARCHITECTURE
Aggregation
takes too long
Cumbersome
programming model
(can be solved with
pig, cascading et al.)
Not
interactive
enough
20. FUNCTIONAL ARCHITECTURE
Strange Events
IngestionRaw Event
Stream
Collection Events Processing Analytics
Warehouse
Fact
Entries
Atomic Event
Frames
Data Lake
Master Data Integration
▪ Buffers load peeks
▪ Ensures message
delivery (fire & forget
for client)
▪ Create user journeys and
unique user sets
▪ Enrich dimensions
▪ Aggregate events to KPIs
▪ Ability to replay for schema
evolution
▪ The representation of truth
▪ Multidimensional data
model
▪ Interactive queries for
actions in realtime and
data exploration
▪ Eternal memory for all
events (even strange
ones)
▪ One schema per event
type. Time partitioned.
▪ Fault tolerant message handling
▪ Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection
▪ Tolerates delayed events
▪ High throughput, moderate latency (~ 1min)
21. SERIAL CONNECTION OF STREAMING AND BATCHING
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
▪ Cool programming model
▪ Uniform dev&ops
▪ Simple solution
▪ High compression ratio due to
column-oriented storage
▪ High scan speed
▪ Cool programming model
▪ Uniform dev&ops
▪ High performance
▪ Interface to R out-of-the-box
▪ Useful libs: MLlib, GraphX, NLP, …
▪ Good connectivity (JDBC,
ODBC, …)
▪ Interactive queries
▪ Uniform ops
▪ Can easily be replaced
due to Hive Metastore
▪ Obvious choice for
cloud-scale messaging
▪ Way the best throughput
and scalability of all
evaluated alternatives
22. public Map<Long, UserJourney>
sessionize(JavaRDD<AtomicEvent> events) {
return events
// Convert to a pair RDD with the userId as key
.mapToPair(e -> new Tuple2<>(e.getUserId(), e))
// Build user journeys
.<UserJourneyAcc>combineByKey(
UserJourneyAcc::create,
UserJourneyAcc::add,
UserJourneyAcc::combine)
// Convert to a Java map
.collectAsMap();
}
24. APACHE FLINK
■ Also has a nice, Spark-like API
■ Promises similar or better
performance than spark
■ Looks like the best solution for a κ-
Architecture
■ But it’s also the newest kid on the
block
25. EVENT VERSUS PROCESSING TIME
■ There’s a difference between even time (te) and processing time
(tp).
■ Events arrive out-of order even during normal operation.
■ Events may arrive arbitrary late.
Apply a grace period before processing events.
Allow arbitrary update windows of metrics.
30. NO RETENTION PARANOIA
Data Lake
Analytics
Warehouse
▪ Eternal memory
▪ Close to raw events
▪ Allows replays and refills
into warehouse
Aggressive forgetting with clearly defined
retention policy per aggregation level like:
▪ 15min:30d
▪ 1h:4m
▪ …
Events
Strange Events
32. ENSURE YOUR SOFTWARE RUNS LOCALLY
The entire architecture must be able to run locally.
Keep the round trips low for development and
testing.
Throughput and reaction times need to be monitored
continuously. Tune your software and the underlying
frameworks as needed.
33. TUNE CONTINUOUSLY
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
Load
generator Throughput & latency probes
System, container and process monitoring
34. IN NUMBERS
Overall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7
Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:
38. CALCULATING UNIQUE USERS
■ We need an exact unique user
count.
■ If you can, you should use an
approximation such as
HyperLogLog.
U1
U2
U3
U1
U4
Time
Users
3 UU 2 UU
4 UU
Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the
Analysis of Algorithms.
40. CHOOSING WHERE TO AGGREGATE
Ingestion Event Data Lake Processing Analytics
Warehouse
Fact
Entries
Analytics
Atomic Event
Frames
1 2
3
- Enrichment
- Preprocessing
- Validation
The hard lifting.
- Processing steps that can
be done at query time.
- Interactive queries.