2. Who am I?
My name is James Grant (james@queeg.org).
I'm a developer here at Brandwatch.
For the last three years I've been a Data
Engineer at Last.fm and the maintainer of their
Hadoop Cluster.
3. Coming up…
● What happens during MapReduce?
● Plays and Reach from music listening data
● The Mapper pseudo code
● The Reducer pseudo code
● The result
● What if…?
4. What happens during MapReduce?
Input Data
Data
Data
Fragment Mapper Map
Data Fragment
Fragment Output
Sort
Data
Data
Reduce Reducer
Fragment Reducer
Fragment
Output Input
5. Plays and Reach from music
listening data
● Plays - The number of times that song has
been played
● Reach - The number of unique listeners to
that song
● Similar to hits and uniques for web
properties
● Input data has columns for user id and song
id (amongst others)
7. The Reducer
function reduce(Integer song, Iterator users):
Integer plays = 0;
Set uniqueUsers = [];
foreach user in users:
increment plays;
if user not within uniqueUsers:
uniqueUsers.add(user);
result.plays = plays;
result.reach = uniqueUsers.cardinality();
emit(song, result);
8. What if…?
You often hear that for nearly all cases you
should use a higher level tool like Pig or Hive to
solve problems.
So what does the Pig script look like for this
problem?
9. Using Pig
subs = LOAD 'submissions.tsv' USING PigStorage()
AS (user:int, song:int);
songs = GROUP subs BY song;
songs = FOREACH songs GENERATE group AS song, subs.user;
songs = FOREACH songs GENERATE
song, COUNT($1.user), COUNT(Distinct($1.user));
STORE songs INTO 'playsreach.tsv';