3. *Q2 2016 data
New Items
80%
Fixed Price
Items
86%
Evolving from our Auction Roots..
Ships for Free
65%
4. EBAY AT A GLANCE
$8.6B
Revenue in 2015
$82B
GMV in 2015
1B
Live Listings *
164M
Global Active Buyers *
190
Countries eBay apps
are available in #
*Q2 2016 data
#Q4 2015 data
326M
App downloads*
6. ENTERPRISE DATA PLATFORM
Data Stores
Data Streams &
Processing
Machine Learning
Enterprise Data Ecosystem
Data Ingestion
Personalization /
Optimization
Insights / Reporting
Kylin
8. Pulsar Stream
• Focus on user behavioral data processing
• Complex event processing
– Streaming SQL with extensible annotations
– Java
• SQL for common stream operations (Filtering, mutation, aggregation) with
time windows
• Declarative topology construction
• Each stage can adopt its own release and deployment cycles
• Dynamic partitioning and flow control
10. Event Filtering and Routing Example
// create filtered stream
insert into FilteredStream select guid, evt_type, C1, C2, C3
from RawStream where evt_type = ‘bid’;
// publish and route filtered stream
@PublishOn(topics=“Topic1”)
@Output(“OutboundChannel”)
@ClusterAffinityTag(column = guid)
select * from SubStream;
11. Aggregate Computation Example
// create 10-second time window context
create context MCContext start @now and pattern
(timer:interval(10)];
// create aggreated stream within specified time window
context MCContext insert into AggStream
select count(*) as M1, guid, evt_type from RawStream
group by guid, evt_type output snapshot when terminated;
// publish aggregated stream
select * from AggStream;
13. TopN Computation Example
// create 60-second time window context
create context MCContext start @now and pattern
(timer:interval(60)];
// create topN stream via sorting
context MCContext insert into TopNStream
select count(*) as M1, guid, evt_type from RawStream
group by guid, evt_type order by M1 limit 10;
// publish topN stream
select * from TopNStream;
15. Pulsar Behavioral Data Pipeline
Sessionizer
Metrics
Calculator
Event
Distributor
Real Time
Consumers
Metrics
Store
Collector
Real-time Pipeline
BOT
Detection
Enriched
Sessionized Events
Producing
Applications
Real Time
Dashboard and Services
Kafka DruidHaddop /
Kylin
Batch
Loader
16. Sessionization: Group together events of a single user visit
e1 e2 e3 e4 e5 e7 e8
User 1: >30 min of inactivity
Session A (User 1): e1, e2, e4
Session B (User 2): e3, e5, e6
Session C (User 1): e7, e8
e6
. . .
17. Sessionization Challenges
• Session state management
– High read/write throughput
– State recovery when node crash/fail
• Session Expiration
– Full table scan is not acceptable
18. Sessionization Solution
• Long live state management (At least 30 minutes)
– Local Off-Heap Cache
• Instantaneous Session Expiration (<= 1sec delay)
– Double-Linked Off-Heap Map (Local Access)
– Order by Expiration time (O(1))
• Pluggable Sessionization logic
– SQL with customized annotation
– Counter
– State
20. Bot Detection Overview
• Detect non-human activities in near realtime
• May treat bot traffic differently during analysis
• High level bot rules
– Self-declared bots by user agent
– Behavior within a session or time window
• Tag events with bot flag
23. Pulsar Integration with Kafka
•Kafka
– Persistent messaging queue
– High availability, scalability and throughput
•Pulsar leveraging Kafka
– Supports pull and hybrid messaging model
– Loading of data from real-time pipeline into Hadoop and other metric stores
– Use schema to validate event payload
2
25. Pulsar Integration with Kylin
•Apache Kylin
– Distributed analytics engine
– SQL interface and multi-dimensional analysis (OLAP) on Hadoop
– Interactive Query on Billions of Rows
•Pulsar leveraging Kylin
– Build multi-dimensional OLAP cube over long time period
– Aggregate/drill-down on dimensions such as browser, OS, device, geo location
– Capture metrics such as session length, page views, event counts
2
26. Pulsar Integration with Druid
•Druid
– Real-time ROLAP engine for aggregation, drill-down and slice-n-dice
•Pulsar leveraging Druid
– Real-time analytics dashboard
– Near real-time metrics like number of visitors in the last 5 minutes, refreshing
every 10 seconds
– Aggregate/drill-down on dimensions such as browser, OS, device, geo location
2
29. TRENDING: ALGORITHMS NARROW THE FOCUS
Algorithms and machine learning
identify significant trends
Humans provide the context
and and interesting story
31. EXAMPLE PERSONALIZED CONTENT
Personalized digest for a
consumer interested in
jewelry and accessories
Personalized digest for a
consumer interested in
auto and electronics