3. Building today’s most powerful,
open, and customizable advertising
technology platform.
4. Ad is served in
<100 milliseconds
WINNING
AUCTION BID
REQUEST
300x250
AD ADVERTISER 1 ADVERTISER 2 ADVERTISER 3
RESPONSE BID: $2.50 BID: $3.25 BID: $4.10
APPNEXUS OPTIMIZATION
5. Evolution of AppNexus
20 350 430 PEOPLE
FROM 100M 39B 45B AD REQUESTS
5000+ MYSQL, HADOOP/HBASE, AEROSPIKE,
SERVERS NETEZZA, VERTICA
38+ TB
OF DATA EVERY DAY
99.99%
UPTIME
8. Python at AppNexus
Python enables us to scale our team and rapidly
iterate and prototype technologies.
9. Hadoop at AppNexus
Hadoop enables us to 1PB
CLUSTER
do aggregations for
reporting and other 862
NODES ACROSS
SEVERAL CLUSTERS
data pipeline jobs
40B
BILLION LOG
RECORDS DAILY
BILLION
5.6B
LOG RECORDS/HOUR
AT PEAK
10. Data modeling today
BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR
Task
Task
Task
Task
logs
logs CACHE
logs VERTICA
logs
HADOOP
Σ
DATA DATA DRIVEN
SERVICES DECISIONING
11. To enable the next
generation of data modeling,
we need to leverage our
Hadoop cluster
12. What are we trying to do
Access the data on Hadoop
Continue to use Python to model
à No consensus on the best solution
So we conducted our own research
to evaluate integration options
13. The budget problem
We have thousands of bidders buying billions
of ads per hour in real-time auctions.
We need to create a model that can manipulate
how our bidders spend their budgets and
purchase ads.
14. Data modeling today
BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR
Task
Task
Task
Task
logs
logs CACHE
logs DATA DRIVEN
VERTICA
logs DECISIONING
HADOOP
Σ
DATA DATA DRIVEN
SERVICES DECISIONING
15. Test problem:
Budget aggregation
SCENARIO:
Each auction creates a row in a log.
timestamp, auction_id, object_type, object_id, method, value
We need to aggregate and model to update
bidders.
16. Method:
Budget aggregation
STEP 1: De-duplicate records where
KEY: object_type, object_id, method, auction_id
STEP 2: Aggregate value where
KEY: object_type, object_id, method
17. HARDWARE
• 300 GB of log data
• 5 nodes running Scientific Linux 6.3 (Carbon)
• Intel Xeon CPU @ 2.13 GHz, 4 cores
• 2 TB Disk
• CDH4
• 45 map, 35 reduce tasks at a time
18. Research: Potential solutions
1. Native Java
2. Streaming ‒ no framework
3. mrjob
4. Happy / Jython / PyCascading
5. Pig + Jython UDF
6. Pydoop prohibitive installation
7. Disco evaluating Hadoop
8. Hadoopy / dumbo similar to mrjob
9. Hipy Effectively ORM for Hive
20. Research: Native Java
Benchmark for comparison, using new Hadoop Java API
BudgetAgg.java Mapper class
BudgetAgg.java Reducer class
21. Research: Native Java
USABILITY:
› Not straightforward for analysts to implement, launch, or tweak
PERFORMANCE:
› Fastest implementation.
› Can further enhance by overriding comparators for grouping and
sorting
22. Research: Native Java
VERSATILITY / FLEXIBILITY:
› Abilityto customize pretty
much everything
› CustomPartitioner,
Comparator, Grouping
Comparator in our
implementation
› Canuse complex objects as
keys or values
24. Research: Streaming
USABILITY:
› Key/value detection has to be done by the user
› Still, straightforward for relatively simple jobs
hadoop jar /usr/lib/hadoop-0.23.0-mr1-cdh4b1/contrib/streaming/hadoop-*streaming*.jar
-D stream.num.map.output.key.fields=4
-D num.key.fields.for.partition=3
-D mapred.reduce.tasks=35
-file mapper.py
-mapper mapper.py
-file reducer.py
-reducer reducer_nongroup.py
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-input /logs/log_budget/v002/2013/03/06/19/
-output bidder_logs/streaming_output
25. Research: Streaming
PERFORMANCE:
› ~50% slower than Java
VERSATILITY / FLEXIBILITY:
› Inputs in reducer are iterated line-by-line
› Straightforward to get de-duplication and agg to work in a single
step
28. Research: mrjob
PERFORMANCE:
› Involvingobjects or
multiple steps slow it
down a lot
VERSATILITY /
FLEXIBILITY:
› Candefine Input /
Internal / Output
protocols
29. Research: Happy / Jython
HAPPY:
› Full access to Java MapReduce API
› Happy project is deprecated
› Depends on Hadoop 0.17
JYTHON:
› Doesn’t work easily out of the box
› Relies on deprecated Jython compiler in Jython 2.2
› Limited to Jython implementation of Python
› Numpy/SciPy and Pandas unavailable
30. Research: PyCascading
Python wrapper around Cascading framework for data
processing workflow.
Uses Jython as high level language for defining
workflows.
31. Research: PyCascading
USABILITY:
› Relatively new project
› Cascading API is simple and intuitive
› Job Planner abstracts details of MapReduce
PERFORMANCE:
› Abstraction makes performance tuning challenging
› Does not support Combiner operation
› Dev time was fast, runtime was slow
33. Research: Pig
Provides a high-level language for data analysis
which is compiled into a sequence of MapReduce
operations.
USABILITY:
34. Research: Pig
USABILITY:
› Powerful debugging and optimization tools (e.g. explain, illustrate)
› Automatically optimizes MapReduce operations:
› Applies Combiner operations where applicable
› Reorders and conflates data flow for efficiency
35. Research: Pig
PERFORMANCE:
› Pig compiler produces performant code
› Complex operations might require manual optimization
› Budget Aggregation require the implementation of a User
Defined
Function in Jython to eliminate unnecessary MapReduce step
36. Research: Pig
VERSATILITY / FLEXIBILITY:
USING PIG + JYTHON UDF
› PigLatin
is expressive and can
capture most use cases
› Define
custom data operations
in Jython called UDFs
› UDFs
can implement custom
loaders, partitioners, and
other advanced features
37. Research: Summary
Running Time / Lines of Code for Implementations
Pig
PyCascading
MRJob
Lines of Code
Running Time
Streaming
Java
0
50
100
150
200
250
300
Running Time (minutes), Lines of Code
38. Research: Recommendations
• Pig and PyCascading enable complex
pipelines to be expressed simply
• Pig is more mature and the most viable
option for ad-hoc analysis