Disco workshop

Disco workshop
From zero to CDN log processing

2

1.  Intro
to
parallel
compu1ng

•  Algorithms

•  Programming
model

•  Applica1ons

2.  Intro
to
MapReduce

•  History

•  (in)applicability

•  Examples

•  Execu1on
overview

3.  Wri1ng
MapReduce
jobs
with
Disco

•  Disco
&
DDFS

•  Python

•  Your
ﬁrst
disco
job

•  Disco
@
SpilGames

4.  CDN
log
processing

•  Architecture

•  Availability
&
Performance
monitoring

•  Steps
to
get
to
our
Disco
landscape

Overview

3

Introduction to
Parallel Computing

4

Tradi1onally
(Neumann
model),
soUware
has
been
wriVen
for

serial
computa1on:

•  To
be
run
on
a
single
computer
having
a
single
CPU

•  A
problem
is
broken
into
discrete
series
of
instruc1ons

•  Instruc1ons
are
executed
one
aUer
another

•  Only
on
instruc1on
may
execute
at
any
moment
in
1me

Serial computations

5

A
parallel
computer
is
of
liVle
use
unless
efficient

parallel
algorithms
are
available

•  The
issues
in
designing
parallel
algorithms
are
very

different
from
those
in
designing
their
sequen1al

counterparts

•  A
significant
amount
of
work
is
being
done
to

develop
efficient
parallel
algorithms
for
a
variety
of

parallel
architectures

Design of efficient algorithms

6

Fibonacci series
(1,1,2,3,5,8,13,21…) by F(n) = F(n-1) + F(n-2)
Sequential algorithm, not parallelizable

7

Parallel
compu1ng
is
the
simultaneous
use
of
mul1ple
compu1ng

resources
to
solve
a
computa1onal
problem:

•  To
be
run
using
mul1ple
CPUs

•  A
problem
is
broken
down
into
discrete
parts
that
can
be

solved
concurrently

•  Each
part
is
further
broken
down
to
a
series
of
instruc1ons

•  Instruc1ons
from
each
part
execute
simultaneously
on

diﬀerent
CPUs

Parallel computations

9

•  Descrip1on

•  The
mental
model
the
programmer
has
about
the
detailed

execu1on
of
their
applica1ons

•  Purpose

•  Improve
programmer
produc1vity

•  Evalua1on

•  Expression

•  Simplicity

•  Performance

Programming Model

10

•  Message
passing

•  Independent
tasks
encapsula1ng
local
data

•  Tasks
interact
by
exchanging
messages

•  Shared
memory

•  Tasks
share
a
common
address
space

•  Tasks
interact
by
reading
and
wri1ng
this
space

asynchronously

•  Data
paralleliza1on

•  Tasks
execute
a
sequence
of
independent
opera1ons

•  Data
usually
evenly
par11oned
across
tasks

•  Also
referred
to
as
“Embarrassingly
parallel”

Parallel Programming Models

11

•  Historically
used
for
large
scale
problems
in
science
and

Engineering

•  Physics
–
applied,
nuclear,
par1cle,
fusion,
photonics

•  Bioscience,
Biotechnology,
Gene1cs,
Sequencing

•  Chemistry,
Molecular
sciences

•  Mechanical
Engineering
–
from
prosthe1cs
to
spacecraU

•  Electrical
Engineering,
Circuit
Design,
Microelectronics

•  Computer
Science,
Mathema1cs

Applications (Scientific)

12

•  Commercial
applica1ons
also
provide
the
driving
force
in
the

parallel
compu1ng.
These
applica1ons
require
the
processing

of
large
amounts
of
data

•  Databases,
data
mining

•  Oil
explora1on

•  Web
search
engines,
web
based
business
services

•  Medical
imaging
and
diagnosis

•  Pharmaceu1cal
design

•  Management
of
na1onal
and
mul1-‐na1onal
corpora1ons

•  Financial
and
economic
modeling

•  Advanced
graphics
&
VR

•  Networked
video
and
mul1-‐media
technologies

Applications (Commercial)

13

•  Parallelize

•  Distribute

•  Problems?

•  Concurrency
problems

•  Coordina1on

•  Scalability

•  Fault
Tolerance

What if my job is too “big”?

14

•  Applica1on
is
modeled
as
Directed
Acyclic
Graph

•  DAG
defines
the
dataflow

•  Computa1onal
ver1ces

•  Ver1ces
of
the
graph
defines
the
opera1on
on
data

•  Channels

•  File

•  TCP
pipe

•  SHM
FIFO

•  Not
as
restric1ve
as
MapReduce

•  Mul1ple
Input
and
Output

•  Allows
developers
to
define
communica1on
between
ver1ces

Microsoft: MSN search group: DRYAD

15

“A
simple
and
powerful
interface
that
enables

automa1c
paralleliza1on
and
distribu1on
of
large-‐scale

computa1ons,
combined
with
an
implementa1on
of

this
interface
that
achieves
high
performance
on
large

clusters
of
commodity
PCs.”

Google
Deen and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”,
Google Inc.

16

Introduction to
MapReduce

17

I
have
a
ques1on
which
a
data
set
can
answer.

I
have
lots
of
data
and
I
have
of
a
cluster
of
nodes.

MapReduce
is
a
parallel
framework
which
takes
advantage

of
my
cluster
by
distribu1ng
the
work
across
each
node.

Speciﬁcally,
MapReduce
maps
data
in
the
form
of
key-‐value

pairs
which
are
then
par11oned
into
buckets.
The
buckets

can
be
spread
easily
over
all
the
nodes
in
the
cluster
and

each
node
or
Reducer,
reduces
the
data
to
an
“answer”
or
a

list
of
“answers”.

What is MapReduce?

18

•  Published
in
2004
by
Google

MapReduce history

19

•  Published
in
2004
by
Google

•  Func1onal
programming
(eg.
Lisp,
Erlang)

•  map()
func1on

•  Applies
a
func1on
to
each
value
of
a
sequence

•  reduce()
func1on
(fold())

•  Combines
all
elements
of
a
sequence
using
a

binary
operator

MapReduce history

20

•  Published
in
2004
by
Google

MapReduce history

21

•  Restric1ve
seman1cs

•  Pipelining
Map/Reduce
stages
possibly
ineﬃcient

•  Solvers
problems
within
a
narrow
programming
domain
well

•  DB
community:
our
parallel
RMDBSs
have
been
doing
this

forever…

•  Data
scale
maVers:
Use
MapReduce
if
you
truly
have
large

data
sets
that
are
diﬃcult
to
process
using
simpler
solu1ons

•  Its
not
always
a
high
performance
solu1on.
Straight
python,

simple
batch
scheduled
Python,
and
C
core
can
all
outperform

MR
by
and
order
of
magnitude
or
two
on
a
single
node
for

many
problems,
even
for
so-‐called
big
data
problems

Why NOT MapReduce?

22

•  Distributed
grep,
sort,
word
frequency

•  Inverted
index
construc1on

•  Page
Rank

•  Web
link-‐graph
traversal

•  Large-‐scale
PDF
genera1on,
image
conversion

•  Ar1ﬁcial
Intelligence,
Machine
Learning

•  Geographical
data,
Google
Maps

•  Log
querying

•  Sta1s1cal
Machine
Transla1on

•  Analyzing
similari1es
of
user’s
behavior

•  Process
clickstream
and
demographic
data

•  Research
for
Ad
systems

•  Ver1cal
search
engine
for
trustworthy
wine
informa1on

What it is good for?

23

•  Google
(proprietary
implementa1on
in
C++)

•  Hadoop
(Open
Source
implementa1on
in
JAVA)

•  Disco
(erlang,
python)

•  Skynet
(ruby)

•  BashReduce
(last.fm)

•  Spark
(Scala,
func1onal
OO
lang.
on
JVM)

•  Plasma
MapReduce
(OCaml)

•  Storm
(The
hadoop
of
Real1me
Processing)

cat
a_bunch_of_ﬁles
|
./mapper.py
|
sort
|
./reducer.py

Flavors of MapReduce

24

•  Process
data
using
special
map()
and
reduce()

func1ons

•  The
map()
func1on
is
called
on
every
item
in
the

input
and
emit
a
series
of
intermediate
key/value

pairs

•  All
values
associated
with
a
given
key
are
grouped

together

•  The
reduce()
func1on
is
called
on
every
unique

key,
and
its
values
list,
and
emits
a
value
that
is

added
to
the
output

The MR programming model

25

•  More
formally

•  Map(k1,
v1)
-‐>
list(k2,
v2)

•  Reduce(k2,
list(v2))
-‐>
list(v2)

The MR programming model

26

•  Greatly
reduces
parallel
programming
complexity

•  Reduces
synchroniza1on
complexity

•  Automa1cally
par11ons
data

•  Provides
failure
transparency

•  Prac1cal

•  Hundreds
of
jobs
every
day

MapReduce benefits

27

•  Par11ons
input
data

•  Schedules
execu1on
across
a
set
of
machines

•  Handles
machine
failure

•  Manages
IPC

The MR runtime system

28

•  Distributed
grep

•  Map
func1on
emits
<word,
line_number>

if
a
word
matches
search
criteria

•  Reduce
func1on
is
iden1ty
func1on

•  URL
access
frequency

•  Map
func1on
processing
web
logs,
emits
<url,
1>

•  Reduce
func1on
summing
values,
emits
<url,
total>

MR Examples

29

•  Geospa1al
Query
processing

•  Given
an
intersec1on,
ﬁnd
all
roads
connec1ng
to
it

•  Rendering
the
1les
in
the
map

•  Finding
the
nearest
feature
to
a
given
address

MR Examples

30

•  “Learning
the
right
abstrac1on
will
simplify
your

life.”
–
Travis
Oliphant

MR Examples
Program
Map()
Reduce()

Distributed
grep
Matched
lines
pass

Reverse
web
link
graph
<target,
source>
<target,
list(src)>

URL
count
<url,
1>
<url,
total_count)

Term-‐vector
per
host
<hostname,
term-‐vector>
<hostname,
all-‐term-‐vector>

Inverted
Index
<word,
doc
id>
<word,
list(doc_id)>

Distributed
Sort
<key,
value>
pass

31

•  The
user
program,
via
the
MR
library,
shards
the

input
data

MR Execution 1/8

32

•  The
user
program
creates
process
copies
(workers)

distributed
on
a
machine
cluster.

•  One
copy
will
be
the
“Master”
and
the
others
will
be

worker
threads

MR Execution 2/8

33

•  The
master
distributes
M
map
and
R
reduce

tasks
to
idle
workers.

•  M
==
number
of
shards

•  R
==
the
key
space
is
divided
into
R
parts

MR Execution 3/8

34

•  Each
map-‐task
worker
reads
assigned
input
shard

and
outputs
intermediate
key/value
pairs

•  Output
buﬀered
in
RAM

MR Execution 4/8

35

•  Each
worker
ﬂushes
intermediate
values,

par11oned
into
R
regions,
to
disk
and
no1ﬁes

the
Master
process

MR Execution 5/8

36

•  Master
process
gives
disk
loca1on
to
an
available

reduce-‐task
worker
who
reads
all
associated

intermediate
data

MR Execution 6/8

37

•  Each
reduce-‐task
worker
sorts
its
intermediate
data.

Calls
the
reduce()
func1on,
passing
unique
keys
and

associated
key
values.
Reduce
func1on
output

appended
to
reduce-‐task’s
par11on
output
ﬁle

MR Execution 7/8

38

•  Master
process
wakes
up
user
process
when

all
tasks
have
completed.

•  Output
contained
in
R
output
ﬁles.

MR Execution 8/8

39

•  An
input
reader

•  A
map()
func1on

•  A
par11on
func1on

•  A
compare
func1on
(sort)

•  A
reduce()
func1on

•  An
output
writer

Hot spots

41

•  Fault
Tolerance

•  Master
process
periodically
pings
workers

•  Map-‐task
failure

–  Re-‐execute

»  All
output
was
stored
locally

•  Reduce-‐task
failure

–  Only
re-‐execute
par1ally
completed
tasks

»  All
output
stored
in
the
global
ﬁle
system

MR Execution Overview

42

•  Don’t
move
data
to
workers…
Move
workers
to
the
data!

•  Store
data
on
local
disks
for
nodes
in
the
cluster

•  Start
up
the
workers
on
the
node
that
has
data
local

•  Why?

•  Not
enough
RAM
to
hold
all
the
data
in
memory

•  Disk
access
is
slow,
disk
throughput
is
good

•  A
distributed
ﬁle
system
is
the
answer

•  GFS
(Google
File
System)
(=
Big
File
System)

•  HDFS
(Hadoop
DFS)
=
GFS
clone

•  DDFS
(Disco
DFS)

Distributed File System

43

•  Sequen1al
-‐>
Parallel
-‐>
Distributed

•  Hype
aUer
Google
published
the
paper
in
2004

•  A
very
narrow
set
of
problems

•  Big-‐data
is
a
marke1ng
buzzword

Summary for Part I.

44

•  MapReduce
is
a
paradigm
for
distributed
compu1ng

developed
(patented…)
by
Google
for
performing

analysis
on
large
amounts
of
data
distributed
across

thousands
of
commodity
computers

•  The
Map
phase
processes
the
input
one
element
at
a

1me
and
returns
a
(key,
value)
pair
for
each
element

•  An
op1onal
Par11on
step
par11ons
Map
results
into

groups
based
on
a
par11on
func1on
on
the
key.

•  The
engine
merges
par11ons
and
sorts
all
the
map

results.

•  The
merged
results
are
passed
to
the
Reduce
phase.

One
or
more
reduce
jobs
reduce
the
(key,
value)
pairs

to
produce
the
ﬁnal
results.

Summary for Part I (cont.)

45

Writing MapReduce jobs
with Disco

46

•  Wri1ng
MapReduce
jobs
can
be
VERY
1me
consuming

•  MapReduce
paVerns

•  Debugging
a
failure
is
a
nightmare

•  Large
clusters
require
a
dedicated
team
to
keep
it
running

•  Wri1ng
a
Disco
job
becomes
a
soUware
engineering
task

•  …rather
than
a
data
analysis
task

Take a deep breath

48

•  “Massive
data
–
Minimal
code”
–
by
Nokia
Research
Center

•  hVp://discoproject.org

•  WriVen
in
Erlang

•  Orchestra1ng
control

•  Robust
fault-‐tolerant
distributed
applica1ons

•  Python
for
opera1ng
on
data

•  Easy
to
learn

•  Complex
algorithms
with
very
liVle
code

•  U1lize
favorite
python
libraries

•  The
complexity
is
hidden,
but…

About Disco

49

•  Distributed

•  Increase
storage
capacity
by
adding
nodes

•  Processing
on
nodes
without
transferring
data

•  Replicated

•  Chunked
data
stored
in
gzip
compressed
chunks

•  Tag
based

•  AVributes

•  CLI

•  $
ddfs
ls
data:log

•  $
ddfs
chunk
data:bigtxt
./bigtxt

•  $
ddfs
blobs
data:bigtxt

•  $
ddfs
xcat
data:bigtxt

Disco Distributed “filesystem”

50

•  Everything
is
preinstalled

•  Disco
localhost
setup:

hVps://github.com/spilgames/disco-‐development-‐workﬂow

Sandbox environment

51

•  www.pythonforbeginners.com
-‐
by
Magnus

•  Import

•  Data
structures:
{}
dict,
[]
list,
()
tuple

•  Deﬁning
func1ons
and
classes

•  Control
ﬂow
primi1ves
and
structures:
for,
if,
…

•  Excep1on
handling

•  Regular
expressions

•  GeoIP,
MySQLdb,
…

•  To
understand
what
yield
does,
you
must
understand
what

generators
are.
And
before
generators
come
iterables.

Python – What you’ll need

52

When
you
create
a
list,
you
can
read
its
items
one
by
one,

and
it’s
called
itera1on:

>>>
mylist
=
[1,
2,
3]

>>>
for
i
in
mylist:

…
print
i

1

2

3

Python Lists

53

Mylist
is
an
iterable.
When
you
use
a
comprehension
list,
you

create
a
list
and
so
an
iterable:

>>>
mylist
=
[x*x
for
x
in
range(3)]

>>>
for
i
in
mylist:

…
print
i

0

1

4

Python Iterables

54

Generators
are
iterables,
but
you
can
read
them
once.
It’s
because

they
do
not
store
all
the
values
in
memory,
they
generate
the
values

on
the
ﬂy:

>>>
mygenerator
=
(x*x
for
x
in
range(3))

>>>
for
i
in
mygenerator:

…
print
i

0

1

4

I
just
the
same
except
you
used
()
instead
of
[].
But,
you
can
not

perform
for
i
in
mygenerator
a
second
1me
since
generators
can
only

be
used
once:
they
calculate
0,
then
forget
about
it
and
calculate
1

and
ends
calcula1ng
4,
one
by
one.

Python Generators

55

Yield
is
a
keyword
that
is
used
like
return,
except
the
func1on
will
return
a

generator.

>>>
def
createGenerator():

…
mylist
=
range(3)

…
for
i
in
mylist:

…

yield
i*i

…

>>>
mygenerator
=
createGenerator()

>>>
print
mygenerator

<generator
object
createGenerator
at
0xb7555c34>

>>>
for
I
in
mygenerator:

…
print
i

0

1

4

Python Yield

56

•  What
is
the
total
count
for
each
unique
word
in
the
text?

•  Word
coun1ng
is
the
Hello
World!
of
MapReduce

•  We
need
to
write
map()
and
reduce()
func1ons

•  Map(rec)
-‐>
list(k,
v)

•  Reduce(k,
v)
-‐>
list(res)

•  Your
applica1on
communicates
with
Disco
API

•  from
disco.core
import
Job,
result_iterator

Your first disco job

57

•  Spli€ng
ﬁle
(related
chunks)
to
lines

•  Map(line,
params)

•  Split
line
to
words

•  Emit
k,v
tuple:
<word,
1>

•  Reduce(iter,
params)

•  OUen,
this
is
an
algebraic
expression

•  <word,
[1,1,1]>
-‐>
<word,
3>

Word count

58

•  Modules
to
import

•  Se€ng
the
master
host

•  DDFS

•  Job()

•  Result_iterator(Job.wait())

•  Job.purge()

Word count: Your application

59

def
fun_map(line,
params):

for
word
in
line.split():

yield
word,
1

Word count: Your map

60

def
fun_reduce(iter,
params):

from
disco.u1l
import
kvgroup

for
word,
counts
in
kvgroup(sorted(iter)):

yield
word,
sum(counts)

Built-‐in
disco.worker.classic.func.sum_reduce()

Word count: Your reduce

61

job
=
Job().run(input=…,
map=fun_map,
reduce=fun_reduce)

for
word,
count
in
result_iterator(job.wait(show=True)):

print
(word,
count)

job.purge()

Word count: Your results

62

Class
MyJob1(Job):

@classmethod

def
map(self,
data,
params):

…

@classmethod

def
reduce(self,
iter,
params):

…

…

MyJob2.run(input=MyJob1.wait())

#
<-‐
Job
chaining

Word count: More advanced

63

•  Event
Tracking
&
Adver1sing
related
jobs

•  Heatmap:
page
clicks
-‐>
2D
density
distribu1ons

•  Reconstruc1ng
sessions

•  Ad
research

•  Behavioral
modeling

•  Log
crunching

•  Gameplays
per
country

•  Frontend
performance
(CDN)

•  404s,
Response
code
tracking

•  Intrusion
detec1on
#security

Disco @ SpilGames

64

•  Calculate
your
resource
need
es1mates

•  Deploy
in
workﬂow

•  We
have

•  Git

•  Package
repository
/
Deployment
Orchestra1on

•  Disco-‐tools:
hVp://github.com/spilgames/disco-‐tools/

•  Job
runner:
hVp://jobrunner/

•  Data
warehouse

•  Interac1ve,
graphical
report
genera1on

Disco @ SpilGames

67

•  Ques1on?

•  Availability
of
each
CDN
provider

•  Data
source

•  Javascript
sampler
on
client
side

•  LoadBalancer
-‐>
HA
logging
endpoints

-‐>
Access
logs
-‐>
Disco
Distributed
FS

CDN Availability monitoring

68


69

•  Input

•  URI
parsing

•  /res.ext?v=o,1|e,1|os,1|ce,1|hw,1|c,0|l,1

•  Expected
output

•  ProviderO

98.7537%

•  ProviderE

57.8851%

•  ProviderC

99.4584%

•  ProviderL

99.4847%


70

#
cdnData:
“o,1|e,1|os,1|ce,1|hw,1|c,0|l,1“

•  Parse
a
log
entry

•  Yield
samples

•  <o,
1>

•  <e,
1>

•  <os,
1>

•  <ce,
1>

•  <hw,
1>

•  <c,
0>

•  <l,
1>

CDN Availability monitoring (map)

71

def
map_cdnAvailability(line,
params):

import
urlparse

try:

(1mestamp,
data)
=
line.split(‘,’,
1)

data
=
dict(urlparse.parse_qsl(data,
False))

for
cdnData
in
data[‘a’].split(‘|’)

try:

cdnName
=
cdnData.split(‘,’)[0]

cdnAvailable
=
int(cdnData.split(‘,’)[1])

yield
cdnName,
cdnAvailabe

except:
pass

except:
pass

CDN Availability monitoring (map)

72

Availability
of
<hw,
[1,1,1,0,1,1,1,0,1,1,0,1]>

•  kvgroup(iter)

•  The
trick:

•  Samples
=
[…]

•  len(samples)
-‐>
number
of
all
samples

•  sum(samples)
-‐>
number
of
available

•  A
=
sum()/len()
*
100.0

CDN Availability monitoring (reduce)

73

def
reduce_cdnAvailability(iter,
params):

from
disco.u1l
import
kvgroup

for
cdnName,
cdnAvailabili1es
in
kvgroup(sorted(iter)):

try:

cdnAvailabili1es
=
list(cdnAvailabili1es)

totalSamples
=
len(cdnAvailabili1es)

totalAvailable
=
sum(cdnAvailabili1es)

totalUnavailable
=
totalSamples
–
totalAvailable

yield
cdnName,
(round(ﬂoat(totalAvailable)
/
totalSamples
*
100.0,
4))

except:
pass

CDN Availability monitoring (reduce)

74

•  DDFS

•  tag://logs:cdn:la010:12345678900

•  disco.ddfs.list(tag)

•  disco.ddfs.[get|set]aVr(tag,aVr,value)

•  Job(name,master).run(input,map,reduce)

•  par11ons
=
R

•  map_reader
=
disco.worker.classic.func.chain_reader

•  save
=
true

Advanced usage

75

CDN Performance
95th percentile with per country breakdown

76

•  Ques1on

•  95th
percen1le
of
response
1mes
per
CDN
per
country

•  Data
source

•  Javascript
sampler
on
client
side

•  LB
-‐>
HA
Logging
endpoints
-‐>
Access
logs
-‐>
DDFS

•  Input

•  /res.ext?v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1

•  Expected
output

•  ProviderN

CountryA:
3891
ms
CountryB:
1198
ms
…

•  ProviderC

CountryA:
3793
ms
CountryB:
1397
ms
…

•  ProviderE

CountryA:
3676
ms
CountryB:
1676
ms
…

•  ProviderL

CountryA:
4332
ms
CountryB:
1233
ms…

CDN Performance

77

The 95th percentile
A 95th percentile says that 95% of the time data points are below that
value and 5% of the time they are above that value.
95 is a magic number used in networking because you have to plan for
the most-of-the-time case.

78

v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1

•  Line
parsing
is
about
the
same

•  Advanced
key:
<cdn:country,
performance>

•  How
to
get
country
from
IP?

•  Job().run(…required_modules=[“GeoIP”]…)

•  No
global
variables!
Within
map()
–
Why?

•  Use
Job().run(…params={}…)
instead

•  yield
“%s:%s“
%
(cdnName,
country),
cdnPerf

CDN Performance (map)

79

#
<hw,
[123,
234,
345,
456,
567,
678,
798]>

def
percen1le(N,
percent,
key=lambda
x:x):

import
math

if
not
N:

return
None

k
=
(len(N)
-‐
1)
*
percent

f
=
math.ﬂoor(k)

c
=
math.ceil(k)

if
f
==
c:

return
key(N[int(k)])

d0
=
key(N[int(f)])
*
(c
-‐
k)

d1
=
key(N[int(c)])
*
(k
-‐
f)

return
d0
+
d1

CDN Performance (reduce)

80

•  Outputs

•  Print
to
screen

•  Write
to
a
ﬁle

•  Write
to
DDFS
–
Why
not?

•  An
other
MR
job
with
chaining

•  Email
it

•  Write
to
MySQL

•  Write
to
Ver1ca

•  Zip
and
upload
to
Spil
OOSS

Other goodies

81

1.  Ques1on
&
Data
source

•  Javascript
code

•  Nginx
endpoint

•  Logrotate

•  (de-‐personalize)

•  DDFS
load
scripts

2.  MR
jobs

3.  Jobrunner
jobs

4.  Present
your
results

Steps to get to our Disco landscape

82

•  Edi1ng
on
live
servers

•  No
version
control

•  No
staging
environment

•  Not
using
deployment
mechanism

•  Not
using
Con1nuous
Integra1on

•  Poor
parsing

•  No
redundancy
for
MC
applica1ons

•  Not
purging
your
job

•  Not
documen1ng
your
job

•  Using
hard
coded
conﬁgura1on
inside
MR
code

Bad habits

83

•  No
peer
review

•  Not
ge€ng
back
events
from
slaves

•  Using
job.wait()

•  Job().run(par11ons=1)

Bad habits cont.

84

•  Wri1ng
Disco
jobs
can
be
easy

•  Finding
the
right
abstrac1on
for
a
problem
is
not…

•  Framework
is
on
the
way
-‐>
DRY

•  You
can
ﬁnd
a
lot
of
good
paVerns
in
SET
and
other

jobs

You
successfully
took
a
step
to
understand
how
to

•  Process
large
amount
of
data

•  Solve
some
speciﬁc
problems
with
MR

Summary

85

•  Ecosystems

•  DiscoDB:
lightning-‐fast
key-‐>value
mapping

•  Discodex:
disco
+
ddfs
+
discodb

•  Disco
vs.
Hadoop

•  HDFS,
Hadoop
ecosystem

•  NoSQL
result
stores

Bonus: Outlook

87

•  Presenta1on
can
be
found
at:

hVp://spil.com/discoworkshop2013

•  You
can
contact
me
at:

zsolt.fabian@spilgames.com

Thank you!

Disco workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Disco workshop

Similar to Disco workshop (20)

Recently uploaded

Recently uploaded (20)

Disco workshop