Python in an Evolving Enterprise System (PyData SV 2013)

DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD
MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMN
DMMIIIIIIIIIIIII$MMMM$IIIIIIIIIIIIIIIOMMMM7III?IIIIIIIIII7MM
MMOIIIIIIIIIIIII7MMMMOIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIMMMMDIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIMMMMM7IIIIIIIIIIIIIII?MMMMMIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIMMMMM8IIIIIIIIIIIIIIIIMMMMMOIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIZMMMMMMIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIZMMMM8IIIIIIIIIIIIIIIII7MMMMMMOIIIIIIIIIMM
MM8$$IIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIII?III8MMMMMMMZIIIIIIMM
MMMMMMMMMMN87IIIIII8MMMMDIIIIIIIIIIIIIIIIIIIZMMMMMMMMMNZIIMM
MMMMMMMMMMMMMMMMMOII$MMMMMIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMM
MMMMMMMMMMMMMMMMMMMM8MMMMMZIIIIIIIIIIIIIIIIIIIIII8MMMMMMMMMM
MMOIIIIIIII7NMMMMMMMMMMMMMMMIIIIIIIIIIIIIIIIIIIIII?IIII$ODMM
MMOIIIIIIIIII?I8MMMMMMMMMMMMDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIIZMMMMMMMMMM7II?IIIIIIIIIIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIIII7NMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIIIIIIIDMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMM8IIIIIIIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMM$IIIIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIIIIIIIIIIIIIOMMMMMMMMMM8I?IIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMMMMMMN7IIIIIIIIIIIMM
MMMMMMD$IIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMDZIIIIIIIMM
MMMMMMMMMMMNIIIIIIIIIIIIIIIIIIIIIINMMMMDZNMMMMMMMMMMMMMMMMMM
MMMMMMMMMMMMMM?IIIIIIIIIIIIIIIIIIIIMMMMMIII7NMMMMMMMMMMMMMMM
MMOIII7DMMMMMMMM$IIIIIIIIIIIIIIIIIIZMMMMNIIIIIIII7$DNMMMMMMM
MMOIIIIII7MMMMMMM7IIIIIIIIIIIIIIIIIIOMMMM8IIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIMMMMMMNIIIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIIIII7MMMMDIIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIMMMMMZIIIIIIIIIIIIIIIIOMMMM$IIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIMM
MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIMMMMMI??IIIIIIIIIIIMM
$MMIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIII8MM
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
777777777777777777777777777777777777777777777777I77

MD$N? MMM8MN OMM8MZ OMMMDM MMDM+ MD~NO M= MM ZZMI
+ZI M7 M O7 MO OM M OMMMMM 8M7 M= MM MN7
MM?+I M7 M O7 MO OM M O+ 8M7 M~ MM ZMI
MDMMN MMNMM7 OMMMM OM M MMM8M MD:N8 MMMMMM MO8M
M7 O7
M7 O7

PYTHON IN
AN EVOLVING
ENTERPRISE SYSTEM
EVALUATING INTEGRATION
SOLUTIONS WITH HADOOP
DAVE HIMROD
STEVE KANNAN
ANGELICA PANDO

Building today’s most powerful,
open, and customizable advertising
technology platform.

Ad is served in
<100 milliseconds

WINNING
AUCTION BID
REQUEST
300x250

AD ADVERTISER 1 ADVERTISER 2 ADVERTISER 3
RESPONSE BID: $2.50 BID: $3.25 BID: $4.10

APPNEXUS OPTIMIZATION

Evolution of AppNexus

20 350 430 PEOPLE
FROM 100M 39B 45B AD REQUESTS

5000+ MYSQL, HADOOP/HBASE, AEROSPIKE,
SERVERS NETEZZA, VERTICA

38+ TB
OF DATA EVERY DAY

99.99%
UPTIME

Evolution of AppNexus

ENG OFFICES ENGINEERING
IN PORTLAND HQ IN NYC
& SF

Data-Driven Decisioning (D3)

Bidder
Bidder
Bidder
BIDDERS

DATA D3
PIPELINE PROCESSING

Python at AppNexus
Python enables us to scale our team and rapidly
iterate and prototype technologies.

Hadoop at AppNexus

Hadoop enables us to 1PB
CLUSTER

do aggregations for
reporting and other 862
NODES ACROSS
SEVERAL CLUSTERS
data pipeline jobs
40B
BILLION LOG
RECORDS DAILY

BILLION
5.6B
LOG RECORDS/HOUR
AT PEAK

Data modeling today
BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR

Task
Task
Task
Task
logs
logs CACHE
logs VERTICA
logs

HADOOP
Σ
DATA DATA DRIVEN
SERVICES DECISIONING

To enable the next
generation of data modeling,
we need to leverage our
Hadoop cluster

What are we trying to do
Access the data on Hadoop
Continue to use Python to model
à No consensus on the best solution

So we conducted our own research
to evaluate integration options

The budget problem
We have thousands of bidders buying billions
of ads per hour in real-time auctions.
We need to create a model that can manipulate
how our bidders spend their budgets and
purchase ads.

Data modeling today
BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR

Task
Task
Task
Task
logs
logs CACHE
logs DATA DRIVEN
VERTICA
logs DECISIONING

HADOOP
Σ
DATA DATA DRIVEN
SERVICES DECISIONING

Test problem:
Budget aggregation
SCENARIO:
Each auction creates a row in a log.

timestamp, auction_id, object_type, object_id, method, value

We need to aggregate and model to update
bidders.

Method:
Budget aggregation
STEP 1: De-duplicate records where
KEY: object_type, object_id, method, auction_id

STEP 2: Aggregate value where
KEY: object_type, object_id, method

HARDWARE
•  300 GB of log data
•  5 nodes running Scientific Linux 6.3 (Carbon)
•  Intel Xeon CPU @ 2.13 GHz, 4 cores
•  2 TB Disk
•  CDH4
•  45 map, 35 reduce tasks at a time

Research: Potential solutions
1. Native Java
2. Streaming ‒ no framework
3. mrjob
4. Happy / Jython / PyCascading
5.  Pig + Jython UDF
6. Pydoop prohibitive installation
7. Disco evaluating Hadoop
8.  Hadoopy / dumbo similar to mrjob
9. Hipy Effectively ORM for Hive

Research: Criteria
1. Usability
2. Performance
3. Versatility / Flexibility

Research: Native Java

Benchmark for comparison, using new Hadoop Java API

BudgetAgg.java Mapper class

BudgetAgg.java Reducer class

USABILITY:
›  Not straightforward for analysts to implement, launch, or tweak

PERFORMANCE:
›  Fastest implementation.
›  Can further enhance by overriding comparators for grouping and
sorting


VERSATILITY / FLEXIBILITY:
›  Abilityto customize pretty
much everything
›  CustomPartitioner,
Comparator, Grouping
Comparator in our
implementation
›  Canuse complex objects as
keys or values

Research: Streaming
Supplies an executable to Hadoop that reads from stdin
and writes to stdout
mapper.py
reducer.py

Research: Streaming
USABILITY:
›  Key/value detection has to be done by the user
›  Still, straightforward for relatively simple jobs

hadoop jar /usr/lib/hadoop-0.23.0-mr1-cdh4b1/contrib/streaming/hadoop-*streaming*.jar
-D stream.num.map.output.key.fields=4
-D num.key.fields.for.partition=3
-D mapred.reduce.tasks=35
-file mapper.py
-mapper mapper.py
-file reducer.py
-reducer reducer_nongroup.py
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-input /logs/log_budget/v002/2013/03/06/19/
-output bidder_logs/streaming_output

Research: Streaming
PERFORMANCE:
›  ~50% slower than Java

›  Inputs in reducer are iterated line-by-line
›  Straightforward to get de-duplication and agg to work in a single
step

Research: mrjob
Open-source Python framework that wraps Hadoop Streaming

USABILITY:
›  “Simplified Java”
›  Great docs, actively developed

python budget_agg.py -r hadoop --hadoop-bin /usr/bin/hadoop
--jobconf stream.num.map.output.key.fields=4
--jobconf num.key.fields.for.partition=3
--jobconf mapred.reduce.tasks=35
--partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-o hdfs:///user/apando/budget_logs/mrjob_output
hdfs:///logs/log_budget/v002/2013/03/06/19/

Research: mrjob
PERFORMANCE:
›  Not much slower than Streaming if only using RawValueProtocol

Research: mrjob
PERFORMANCE:
›  Involvingobjects or
multiple steps slow it
down a lot

VERSATILITY /
FLEXIBILITY:
›  Candefine Input /
Internal / Output
protocols

Research: Happy / Jython
HAPPY:
›  Full access to Java MapReduce API
›  Happy project is deprecated
›  Depends on Hadoop 0.17

JYTHON:
›  Doesn’t work easily out of the box
›  Relies on deprecated Jython compiler in Jython 2.2
›  Limited to Jython implementation of Python
›  Numpy/SciPy and Pandas unavailable

Research: PyCascading
Python wrapper around Cascading framework for data
processing workflow.
Uses Jython as high level language for defining
workflows.

USABILITY:
›  Relatively new project
›  Cascading API is simple and intuitive
›  Job Planner abstracts details of MapReduce

PERFORMANCE:
›  Abstraction makes performance tuning challenging
›  Does not support Combiner operation
›  Dev time was fast, runtime was slow

›  Allows Jython UDFs
›  Rich set of built-in functions: GroupBy, Join, Merge

Research: Pig
Provides a high-level language for data analysis
which is compiled into a sequence of MapReduce
operations.

USABILITY:

Research: Pig
USABILITY:
›  Powerful debugging and optimization tools (e.g. explain, illustrate)

›  Automatically optimizes MapReduce operations:
›  Applies Combiner operations where applicable
›  Reorders and conflates data flow for efficiency

Research: Pig
PERFORMANCE:
›  Pig compiler produces performant code
›  Complex operations might require manual optimization
›  Budget Aggregation require the implementation of a User
Defined
Function in Jython to eliminate unnecessary MapReduce step

Research: Pig
USING PIG + JYTHON UDF
›  PigLatin
is expressive and can
capture most use cases
›  Define
custom data operations
in Jython called UDFs
›  UDFs
can implement custom
loaders, partitioners, and
other advanced features

Research: Summary
Running Time / Lines of Code for Implementations

Pig

PyCascading

MRJob
Lines of Code
Running Time

Streaming

Java

0
50
100
150
200
250
300
Running Time (minutes), Lines of Code

Research: Recommendations

•  Pig and PyCascading enable complex
pipelines to be expressed simply
•  Pig is more mature and the most viable
option for ad-hoc analysis

??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ???

???????
??:::::::??
??:::::::::::?
?:::::????:::::?
?::::? ?::::?
?::::? ?::::?
?????? ?::::?
QUESTIONS ?::::?
?::::?
?::::?
?::::?
?::::?
?::::?
pydata@appnexus.com ??::??
????
???
??:??
???

Python in an Evolving Enterprise System (PyData SV 2013)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Python in an Evolving Enterprise System (PyData SV 2013)

Semelhante a Python in an Evolving Enterprise System (PyData SV 2013) (20)

Mais de PyData

Mais de PyData (20)

Último

Último (20)

Python in an Evolving Enterprise System (PyData SV 2013)