This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ )
Map Reduce: Beyond Word Count by Jeff Patti
Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline
Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.
2. What is MapReduce?
“MapReduce is a programming model for processing large
data sets with a parallel, distributed algorithm on a cluster.”
- Wikipedia
Map - given a line of a file, yield key: value pairs
Reduce - given a key and all values with that key from the
prior map phase, yield key: value pairs
4. Word Count Using mrjob
def mapper(self, key, line):
for word in line.split():
yield word, 1
def reducer(self, word, occurrences):
yield word, sum(occurrences)
6. Monetate Background
● Core products are merchandising,
personalization, testing, etc.
● A/B & Multivariate testing to determine
impact of experiments
● Involved with >20% of ecommerce spend
each holiday season for the past 2 years
running
7. Monetate Stack
● Distributed across multiple availability zones
and regions for redundancy, scaling, and
lower round trip times
● Real time decision engine using MySQL
● Nightly processing of each days data via
Hadoop using mrjob, a python library for
writing mapreduce jobs
8. Beyond Word Count
● Activity stream sessionization
● Product recommendations
● User behavior statistics
9. Activity Stream Sessionization
Goal: collate user activity, splitting into different
sessions if user inactive for more than 5
minutes
Input format: timestamp, user_id
14. Product Recommendations
Goal: For each product a client sells, generate
a ‘people who bought this also bought this’
recommendation
Input: product_id_1, product_id_2, ...
21. User Behavior Statistics
Goal: compute statistics about user behavior
(conversion rate & time on site) by account and
experiment in an efficient manner
Input:
account_id, campaigns_viewed, user_id, purchased?,
session_start_time, session_end_time
22. Statistics Primer
With sample count, mean, and variance for
each side of an experiment we can compute all
the statistics our analytics package displays
23. Statistics Primer (cont.)
y = a sessions metric value, ex: time on site
● Sample count: count the number of sessions
that viewed the experiment
○ sum(y^0)
● Mean: sum the metric / sample count
○ sum(y^1)/sum(y^0)
24. Statistics Primer (cont.)
● Variance:
○ Variance = mean of square minus square of mean
○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2
For each side of an experiment we only need to
generate: sum(y^0), sum(y^1), sum(y^2)