2. 2
1. Intro
to
parallel
compu1ng
• Algorithms
• Programming
model
• Applica1ons
2. Intro
to
MapReduce
• History
• (in)applicability
• Examples
• Execu1on
overview
3. Wri1ng
MapReduce
jobs
with
Disco
• Disco
&
DDFS
• Python
• Your
first
disco
job
• Disco
@
SpilGames
4. CDN
log
processing
• Architecture
• Availability
&
Performance
monitoring
• Steps
to
get
to
our
Disco
landscape
Overview
4. 4
Tradi1onally
(Neumann
model),
soUware
has
been
wriVen
for
serial
computa1on:
• To
be
run
on
a
single
computer
having
a
single
CPU
• A
problem
is
broken
into
discrete
series
of
instruc1ons
• Instruc1ons
are
executed
one
aUer
another
• Only
on
instruc1on
may
execute
at
any
moment
in
1me
Serial computations
5. 5
A
parallel
computer
is
of
liVle
use
unless
efficient
parallel
algorithms
are
available
• The
issues
in
designing
parallel
algorithms
are
very
different
from
those
in
designing
their
sequen1al
counterparts
• A
significant
amount
of
work
is
being
done
to
develop
efficient
parallel
algorithms
for
a
variety
of
parallel
architectures
Design of efficient algorithms
7. 7
Parallel
compu1ng
is
the
simultaneous
use
of
mul1ple
compu1ng
resources
to
solve
a
computa1onal
problem:
• To
be
run
using
mul1ple
CPUs
• A
problem
is
broken
down
into
discrete
parts
that
can
be
solved
concurrently
• Each
part
is
further
broken
down
to
a
series
of
instruc1ons
• Instruc1ons
from
each
part
execute
simultaneously
on
different
CPUs
Parallel computations
9. 9
• Descrip1on
• The
mental
model
the
programmer
has
about
the
detailed
execu1on
of
their
applica1ons
• Purpose
• Improve
programmer
produc1vity
• Evalua1on
• Expression
• Simplicity
• Performance
Programming Model
10. 10
• Message
passing
• Independent
tasks
encapsula1ng
local
data
• Tasks
interact
by
exchanging
messages
• Shared
memory
• Tasks
share
a
common
address
space
• Tasks
interact
by
reading
and
wri1ng
this
space
asynchronously
• Data
paralleliza1on
• Tasks
execute
a
sequence
of
independent
opera1ons
• Data
usually
evenly
par11oned
across
tasks
• Also
referred
to
as
“Embarrassingly
parallel”
Parallel Programming Models
11. 11
• Historically
used
for
large
scale
problems
in
science
and
Engineering
• Physics
–
applied,
nuclear,
par1cle,
fusion,
photonics
• Bioscience,
Biotechnology,
Gene1cs,
Sequencing
• Chemistry,
Molecular
sciences
• Mechanical
Engineering
–
from
prosthe1cs
to
spacecraU
• Electrical
Engineering,
Circuit
Design,
Microelectronics
• Computer
Science,
Mathema1cs
Applications (Scientific)
12. 12
• Commercial
applica1ons
also
provide
the
driving
force
in
the
parallel
compu1ng.
These
applica1ons
require
the
processing
of
large
amounts
of
data
• Databases,
data
mining
• Oil
explora1on
• Web
search
engines,
web
based
business
services
• Medical
imaging
and
diagnosis
• Pharmaceu1cal
design
• Management
of
na1onal
and
mul1-‐na1onal
corpora1ons
• Financial
and
economic
modeling
• Advanced
graphics
&
VR
• Networked
video
and
mul1-‐media
technologies
Applications (Commercial)
13. 13
• Parallelize
• Distribute
• Problems?
• Concurrency
problems
• Coordina1on
• Scalability
• Fault
Tolerance
What if my job is too “big”?
14. 14
• Applica1on
is
modeled
as
Directed
Acyclic
Graph
• DAG
defines
the
dataflow
• Computa1onal
ver1ces
• Ver1ces
of
the
graph
defines
the
opera1on
on
data
• Channels
• File
• TCP
pipe
• SHM
FIFO
• Not
as
restric1ve
as
MapReduce
• Mul1ple
Input
and
Output
• Allows
developers
to
define
communica1on
between
ver1ces
Microsoft: MSN search group: DRYAD
15. 15
“A
simple
and
powerful
interface
that
enables
automa1c
paralleliza1on
and
distribu1on
of
large-‐scale
computa1ons,
combined
with
an
implementa1on
of
this
interface
that
achieves
high
performance
on
large
clusters
of
commodity
PCs.”
Google
Deen and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”,
Google Inc.
17. 17
I
have
a
ques1on
which
a
data
set
can
answer.
I
have
lots
of
data
and
I
have
of
a
cluster
of
nodes.
MapReduce
is
a
parallel
framework
which
takes
advantage
of
my
cluster
by
distribu1ng
the
work
across
each
node.
Specifically,
MapReduce
maps
data
in
the
form
of
key-‐value
pairs
which
are
then
par11oned
into
buckets.
The
buckets
can
be
spread
easily
over
all
the
nodes
in
the
cluster
and
each
node
or
Reducer,
reduces
the
data
to
an
“answer”
or
a
list
of
“answers”.
What is MapReduce?
19. 19
• Published
in
2004
by
Google
• Func1onal
programming
(eg.
Lisp,
Erlang)
• map()
func1on
• Applies
a
func1on
to
each
value
of
a
sequence
• reduce()
func1on
(fold())
• Combines
all
elements
of
a
sequence
using
a
binary
operator
MapReduce history
21. 21
• Restric1ve
seman1cs
• Pipelining
Map/Reduce
stages
possibly
inefficient
• Solvers
problems
within
a
narrow
programming
domain
well
• DB
community:
our
parallel
RMDBSs
have
been
doing
this
forever…
• Data
scale
maVers:
Use
MapReduce
if
you
truly
have
large
data
sets
that
are
difficult
to
process
using
simpler
solu1ons
• Its
not
always
a
high
performance
solu1on.
Straight
python,
simple
batch
scheduled
Python,
and
C
core
can
all
outperform
MR
by
and
order
of
magnitude
or
two
on
a
single
node
for
many
problems,
even
for
so-‐called
big
data
problems
Why NOT MapReduce?
22. 22
• Distributed
grep,
sort,
word
frequency
• Inverted
index
construc1on
• Page
Rank
• Web
link-‐graph
traversal
• Large-‐scale
PDF
genera1on,
image
conversion
• Ar1ficial
Intelligence,
Machine
Learning
• Geographical
data,
Google
Maps
• Log
querying
• Sta1s1cal
Machine
Transla1on
• Analyzing
similari1es
of
user’s
behavior
• Process
clickstream
and
demographic
data
• Research
for
Ad
systems
• Ver1cal
search
engine
for
trustworthy
wine
informa1on
What it is good for?
23. 23
• Google
(proprietary
implementa1on
in
C++)
• Hadoop
(Open
Source
implementa1on
in
JAVA)
• Disco
(erlang,
python)
• Skynet
(ruby)
• BashReduce
(last.fm)
• Spark
(Scala,
func1onal
OO
lang.
on
JVM)
• Plasma
MapReduce
(OCaml)
• Storm
(The
hadoop
of
Real1me
Processing)
cat
a_bunch_of_files
|
./mapper.py
|
sort
|
./reducer.py
Flavors of MapReduce
24. 24
• Process
data
using
special
map()
and
reduce()
func1ons
• The
map()
func1on
is
called
on
every
item
in
the
input
and
emit
a
series
of
intermediate
key/value
pairs
• All
values
associated
with
a
given
key
are
grouped
together
• The
reduce()
func1on
is
called
on
every
unique
key,
and
its
values
list,
and
emits
a
value
that
is
added
to
the
output
The MR programming model
25. 25
• More
formally
• Map(k1,
v1)
-‐>
list(k2,
v2)
• Reduce(k2,
list(v2))
-‐>
list(v2)
The MR programming model
26. 26
• Greatly
reduces
parallel
programming
complexity
• Reduces
synchroniza1on
complexity
• Automa1cally
par11ons
data
• Provides
failure
transparency
• Prac1cal
• Hundreds
of
jobs
every
day
MapReduce benefits
27. 27
• Par11ons
input
data
• Schedules
execu1on
across
a
set
of
machines
• Handles
machine
failure
• Manages
IPC
The MR runtime system
28. 28
• Distributed
grep
• Map
func1on
emits
<word,
line_number>
if
a
word
matches
search
criteria
• Reduce
func1on
is
iden1ty
func1on
• URL
access
frequency
• Map
func1on
processing
web
logs,
emits
<url,
1>
• Reduce
func1on
summing
values,
emits
<url,
total>
MR Examples
29. 29
• Geospa1al
Query
processing
• Given
an
intersec1on,
find
all
roads
connec1ng
to
it
• Rendering
the
1les
in
the
map
• Finding
the
nearest
feature
to
a
given
address
MR Examples
30. 30
• “Learning
the
right
abstrac1on
will
simplify
your
life.”
–
Travis
Oliphant
MR Examples
Program
Map()
Reduce()
Distributed
grep
Matched
lines
pass
Reverse
web
link
graph
<target,
source>
<target,
list(src)>
URL
count
<url,
1>
<url,
total_count)
Term-‐vector
per
host
<hostname,
term-‐vector>
<hostname,
all-‐term-‐vector>
Inverted
Index
<word,
doc
id>
<word,
list(doc_id)>
Distributed
Sort
<key,
value>
pass
31. 31
• The
user
program,
via
the
MR
library,
shards
the
input
data
MR Execution 1/8
32. 32
• The
user
program
creates
process
copies
(workers)
distributed
on
a
machine
cluster.
• One
copy
will
be
the
“Master”
and
the
others
will
be
worker
threads
MR Execution 2/8
33. 33
• The
master
distributes
M
map
and
R
reduce
tasks
to
idle
workers.
• M
==
number
of
shards
• R
==
the
key
space
is
divided
into
R
parts
MR Execution 3/8
34. 34
• Each
map-‐task
worker
reads
assigned
input
shard
and
outputs
intermediate
key/value
pairs
• Output
buffered
in
RAM
MR Execution 4/8
35. 35
• Each
worker
flushes
intermediate
values,
par11oned
into
R
regions,
to
disk
and
no1fies
the
Master
process
MR Execution 5/8
36. 36
• Master
process
gives
disk
loca1on
to
an
available
reduce-‐task
worker
who
reads
all
associated
intermediate
data
MR Execution 6/8
37. 37
• Each
reduce-‐task
worker
sorts
its
intermediate
data.
Calls
the
reduce()
func1on,
passing
unique
keys
and
associated
key
values.
Reduce
func1on
output
appended
to
reduce-‐task’s
par11on
output
file
MR Execution 7/8
38. 38
• Master
process
wakes
up
user
process
when
all
tasks
have
completed.
• Output
contained
in
R
output
files.
MR Execution 8/8
39. 39
• An
input
reader
• A
map()
func1on
• A
par11on
func1on
• A
compare
func1on
(sort)
• A
reduce()
func1on
• An
output
writer
Hot spots
41. 41
• Fault
Tolerance
• Master
process
periodically
pings
workers
• Map-‐task
failure
– Re-‐execute
» All
output
was
stored
locally
• Reduce-‐task
failure
– Only
re-‐execute
par1ally
completed
tasks
» All
output
stored
in
the
global
file
system
MR Execution Overview
42. 42
• Don’t
move
data
to
workers…
Move
workers
to
the
data!
• Store
data
on
local
disks
for
nodes
in
the
cluster
• Start
up
the
workers
on
the
node
that
has
data
local
• Why?
• Not
enough
RAM
to
hold
all
the
data
in
memory
• Disk
access
is
slow,
disk
throughput
is
good
• A
distributed
file
system
is
the
answer
• GFS
(Google
File
System)
(=
Big
File
System)
• HDFS
(Hadoop
DFS)
=
GFS
clone
• DDFS
(Disco
DFS)
Distributed File System
43. 43
• Sequen1al
-‐>
Parallel
-‐>
Distributed
• Hype
aUer
Google
published
the
paper
in
2004
• A
very
narrow
set
of
problems
• Big-‐data
is
a
marke1ng
buzzword
Summary for Part I.
44. 44
• MapReduce
is
a
paradigm
for
distributed
compu1ng
developed
(patented…)
by
Google
for
performing
analysis
on
large
amounts
of
data
distributed
across
thousands
of
commodity
computers
• The
Map
phase
processes
the
input
one
element
at
a
1me
and
returns
a
(key,
value)
pair
for
each
element
• An
op1onal
Par11on
step
par11ons
Map
results
into
groups
based
on
a
par11on
func1on
on
the
key.
• The
engine
merges
par11ons
and
sorts
all
the
map
results.
• The
merged
results
are
passed
to
the
Reduce
phase.
One
or
more
reduce
jobs
reduce
the
(key,
value)
pairs
to
produce
the
final
results.
Summary for Part I (cont.)
46. 46
• Wri1ng
MapReduce
jobs
can
be
VERY
1me
consuming
• MapReduce
paVerns
• Debugging
a
failure
is
a
nightmare
• Large
clusters
require
a
dedicated
team
to
keep
it
running
• Wri1ng
a
Disco
job
becomes
a
soUware
engineering
task
• …rather
than
a
data
analysis
task
Take a deep breath
48. 48
• “Massive
data
–
Minimal
code”
–
by
Nokia
Research
Center
• hVp://discoproject.org
• WriVen
in
Erlang
• Orchestra1ng
control
• Robust
fault-‐tolerant
distributed
applica1ons
• Python
for
opera1ng
on
data
• Easy
to
learn
• Complex
algorithms
with
very
liVle
code
• U1lize
favorite
python
libraries
• The
complexity
is
hidden,
but…
About Disco
49. 49
• Distributed
• Increase
storage
capacity
by
adding
nodes
• Processing
on
nodes
without
transferring
data
• Replicated
• Chunked
data
stored
in
gzip
compressed
chunks
• Tag
based
• AVributes
• CLI
• $
ddfs
ls
data:log
• $
ddfs
chunk
data:bigtxt
./bigtxt
• $
ddfs
blobs
data:bigtxt
• $
ddfs
xcat
data:bigtxt
Disco Distributed “filesystem”
50. 50
• Everything
is
preinstalled
• Disco
localhost
setup:
hVps://github.com/spilgames/disco-‐development-‐workflow
Sandbox environment
51. 51
• www.pythonforbeginners.com
-‐
by
Magnus
• Import
• Data
structures:
{}
dict,
[]
list,
()
tuple
• Defining
func1ons
and
classes
• Control
flow
primi1ves
and
structures:
for,
if,
…
• Excep1on
handling
• Regular
expressions
• GeoIP,
MySQLdb,
…
• To
understand
what
yield
does,
you
must
understand
what
generators
are.
And
before
generators
come
iterables.
Python – What you’ll need
52. 52
When
you
create
a
list,
you
can
read
its
items
one
by
one,
and
it’s
called
itera1on:
>>>
mylist
=
[1,
2,
3]
>>>
for
i
in
mylist:
…
print
i
1
2
3
Python Lists
53. 53
Mylist
is
an
iterable.
When
you
use
a
comprehension
list,
you
create
a
list
and
so
an
iterable:
>>>
mylist
=
[x*x
for
x
in
range(3)]
>>>
for
i
in
mylist:
…
print
i
0
1
4
Python Iterables
54. 54
Generators
are
iterables,
but
you
can
read
them
once.
It’s
because
they
do
not
store
all
the
values
in
memory,
they
generate
the
values
on
the
fly:
>>>
mygenerator
=
(x*x
for
x
in
range(3))
>>>
for
i
in
mygenerator:
…
print
i
0
1
4
I
just
the
same
except
you
used
()
instead
of
[].
But,
you
can
not
perform
for
i
in
mygenerator
a
second
1me
since
generators
can
only
be
used
once:
they
calculate
0,
then
forget
about
it
and
calculate
1
and
ends
calcula1ng
4,
one
by
one.
Python Generators
55. 55
Yield
is
a
keyword
that
is
used
like
return,
except
the
func1on
will
return
a
generator.
>>>
def
createGenerator():
…
mylist
=
range(3)
…
for
i
in
mylist:
…
yield
i*i
…
>>>
mygenerator
=
createGenerator()
>>>
print
mygenerator
<generator
object
createGenerator
at
0xb7555c34>
>>>
for
I
in
mygenerator:
…
print
i
0
1
4
Python Yield
56. 56
• What
is
the
total
count
for
each
unique
word
in
the
text?
• Word
coun1ng
is
the
Hello
World!
of
MapReduce
• We
need
to
write
map()
and
reduce()
func1ons
• Map(rec)
-‐>
list(k,
v)
• Reduce(k,
v)
-‐>
list(res)
• Your
applica1on
communicates
with
Disco
API
• from
disco.core
import
Job,
result_iterator
Your first disco job
57. 57
• Spli€ng
file
(related
chunks)
to
lines
• Map(line,
params)
• Split
line
to
words
• Emit
k,v
tuple:
<word,
1>
• Reduce(iter,
params)
• OUen,
this
is
an
algebraic
expression
• <word,
[1,1,1]>
-‐>
<word,
3>
Word count
58. 58
• Modules
to
import
• Se€ng
the
master
host
• DDFS
• Job()
• Result_iterator(Job.wait())
• Job.purge()
Word count: Your application
59. 59
def
fun_map(line,
params):
for
word
in
line.split():
yield
word,
1
Word count: Your map
60. 60
def
fun_reduce(iter,
params):
from
disco.u1l
import
kvgroup
for
word,
counts
in
kvgroup(sorted(iter)):
yield
word,
sum(counts)
Built-‐in
disco.worker.classic.func.sum_reduce()
Word count: Your reduce
61. 61
job
=
Job().run(input=…,
map=fun_map,
reduce=fun_reduce)
for
word,
count
in
result_iterator(job.wait(show=True)):
print
(word,
count)
job.purge()
Word count: Your results
62. 62
Class
MyJob1(Job):
@classmethod
def
map(self,
data,
params):
…
@classmethod
def
reduce(self,
iter,
params):
…
…
MyJob2.run(input=MyJob1.wait())
#
<-‐
Job
chaining
Word count: More advanced
63. 63
• Event
Tracking
&
Adver1sing
related
jobs
• Heatmap:
page
clicks
-‐>
2D
density
distribu1ons
• Reconstruc1ng
sessions
• Ad
research
• Behavioral
modeling
• Log
crunching
• Gameplays
per
country
• Frontend
performance
(CDN)
• 404s,
Response
code
tracking
• Intrusion
detec1on
#security
Disco @ SpilGames
64. 64
• Calculate
your
resource
need
es1mates
• Deploy
in
workflow
• We
have
• Git
• Package
repository
/
Deployment
Orchestra1on
• Disco-‐tools:
hVp://github.com/spilgames/disco-‐tools/
• Job
runner:
hVp://jobrunner/
• Data
warehouse
• Interac1ve,
graphical
report
genera1on
Disco @ SpilGames
67. 67
• Ques1on?
• Availability
of
each
CDN
provider
• Data
source
• Javascript
sampler
on
client
side
• LoadBalancer
-‐>
HA
logging
endpoints
-‐>
Access
logs
-‐>
Disco
Distributed
FS
CDN Availability monitoring
72. 72
Availability
of
<hw,
[1,1,1,0,1,1,1,0,1,1,0,1]>
• kvgroup(iter)
• The
trick:
• Samples
=
[…]
• len(samples)
-‐>
number
of
all
samples
• sum(samples)
-‐>
number
of
available
• A
=
sum()/len()
*
100.0
CDN Availability monitoring (reduce)
76. 76
• Ques1on
• 95th
percen1le
of
response
1mes
per
CDN
per
country
• Data
source
• Javascript
sampler
on
client
side
• LB
-‐>
HA
Logging
endpoints
-‐>
Access
logs
-‐>
DDFS
• Input
• /res.ext?v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1
• Expected
output
• ProviderN
CountryA:
3891
ms
CountryB:
1198
ms
…
• ProviderC
CountryA:
3793
ms
CountryB:
1397
ms
…
• ProviderE
CountryA:
3676
ms
CountryB:
1676
ms
…
• ProviderL
CountryA:
4332
ms
CountryB:
1233
ms…
CDN Performance
77. 77
The 95th percentile
A 95th percentile says that 95% of the time data points are below that
value and 5% of the time they are above that value.
95 is a magic number used in networking because you have to plan for
the most-of-the-time case.
78. 78
v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1
• Line
parsing
is
about
the
same
• Advanced
key:
<cdn:country,
performance>
• How
to
get
country
from
IP?
• Job().run(…required_modules=[“GeoIP”]…)
• No
global
variables!
Within
map()
–
Why?
• Use
Job().run(…params={}…)
instead
• yield
“%s:%s“
%
(cdnName,
country),
cdnPerf
CDN Performance (map)
79. 79
#
<hw,
[123,
234,
345,
456,
567,
678,
798]>
def
percen1le(N,
percent,
key=lambda
x:x):
import
math
if
not
N:
return
None
k
=
(len(N)
-‐
1)
*
percent
f
=
math.floor(k)
c
=
math.ceil(k)
if
f
==
c:
return
key(N[int(k)])
d0
=
key(N[int(f)])
*
(c
-‐
k)
d1
=
key(N[int(c)])
*
(k
-‐
f)
return
d0
+
d1
CDN Performance (reduce)
80. 80
• Outputs
• Print
to
screen
• Write
to
a
file
• Write
to
DDFS
–
Why
not?
• An
other
MR
job
with
chaining
• Email
it
• Write
to
MySQL
• Write
to
Ver1ca
• Zip
and
upload
to
Spil
OOSS
Other goodies
81. 81
1. Ques1on
&
Data
source
• Javascript
code
• Nginx
endpoint
• Logrotate
• (de-‐personalize)
• DDFS
load
scripts
2. MR
jobs
3. Jobrunner
jobs
4. Present
your
results
Steps to get to our Disco landscape
82. 82
• Edi1ng
on
live
servers
• No
version
control
• No
staging
environment
• Not
using
deployment
mechanism
• Not
using
Con1nuous
Integra1on
• Poor
parsing
• No
redundancy
for
MC
applica1ons
• Not
purging
your
job
• Not
documen1ng
your
job
• Using
hard
coded
configura1on
inside
MR
code
Bad habits
83. 83
• No
peer
review
• Not
ge€ng
back
events
from
slaves
• Using
job.wait()
• Job().run(par11ons=1)
Bad habits cont.
84. 84
• Wri1ng
Disco
jobs
can
be
easy
• Finding
the
right
abstrac1on
for
a
problem
is
not…
• Framework
is
on
the
way
-‐>
DRY
• You
can
find
a
lot
of
good
paVerns
in
SET
and
other
jobs
You
successfully
took
a
step
to
understand
how
to
• Process
large
amount
of
data
• Solve
some
specific
problems
with
MR
Summary
85. 85
• Ecosystems
• DiscoDB:
lightning-‐fast
key-‐>value
mapping
• Discodex:
disco
+
ddfs
+
discodb
• Disco
vs.
Hadoop
• HDFS,
Hadoop
ecosystem
• NoSQL
result
stores
Bonus: Outlook