Long journey of Ruby standard library at RubyConf AU 2024
Processing Big Data (Chapter 3, SC 11 Tutorial)
1. An
Introduc+on
to
Data
Intensive
Compu+ng
Chapter
3:
Processing
Big
Data
Robert
Grossman
University
of
Chicago
Open
Data
Group
Collin
BenneC
Open
Data
Group
November
14,
2011
1
2. 1. Introduc+on
(0830-‐0900)
a. Data
clouds
(e.g.
Hadoop)
b. U+lity
clouds
(e.g.
Amazon)
2. Managing
Big
Data
(0900-‐0945)
a. Databases
b. Distributed
File
Systems
(e.g.
Hadoop)
c. NoSql
databases
(e.g.
HBase)
3. Processing
Big
Data
(0945-‐1000
and
1030-‐1100)
a. Mul+ple
Virtual
Machines
&
Message
Queues
b. MapReduce
c. Streams
over
distributed
file
systems
4. Lab
using
Amazon’s
Elas+c
Map
Reduce
(1100-‐1200)
3. Sec+on
3.1
Processing
Big
Data
Using
U+lity
and
Data
Clouds
A
Google
produc+on
rack
of
servers
from
about
1999.
4. • How
do
you
do
analy+cs
over
commodity
disks
and
processors?
• How
do
you
improve
the
efficiency
of
programmers?
5. Serial
&
SMP
Algorithms
Task
Task
Task
Task
local
disk*
local
disk*
Serial
algorithm
Symmetric
Mul+processing
(SMP)
algorithm
• *
local
disk
and
memory
6. Pleasantly
(=
Embarrassingly)
Parallel
Task
Task
Task
Task
Task
Task
Task
Task
Task
local
disk
local
disk
local
disk
MPI
• Need
to
par++on
data,
start
tasks,
collect
results.
• Oden
tasks
organized
into
DAG.
8. The
Google
Data
Stack
• The
Google
File
System
(2003)
• MapReduce:
Simplified
Data
Processing…
(2004)
• BigTable:
A
Distributed
Storage
System…
(2006)
8
9. Google’s
Large
Data
Cloud
Applica+ons
Compute
Services
Google’s
MapReduce
Data
Services
Google’s
BigTable
Storage
Services
Google
File
System
(GFS)
Google’s
Early
Data
Stack
circa
2000
9
10. Hadoop’s
Large
Data
Cloud
(Open
Source)
Applica+ons
Compute
Services
Hadoop’s
MapReduce
Data
Services
NoSQL,
e.g.
HBase
Storage
Services
Hadoop
Distributed
File
System
(HDFS)
Hadoop’s
Stack
10
12. The
Amazon
Data
Stack
Amazon
uses
a
highly
decentralized,
loosely
coupled,
service
oriented
architecture
consis+ng
of
hundreds
of
services.
In
this
environment
there
is
a
par+cular
need
for
storage
technologies
that
are
always
available.
For
example,
customers
should
be
able
to
view
and
add
items
to
their
shopping
cart
even
if
disks
are
failing,
network
routes
are
flapping,
or
data
centers
are
being
destroyed
by
tornados.
SOSP’07
14. Open
Source
Versions
• Eucalyptus
– Ability
to
launch
VMs
– S3
like
storage
• Open
Stack
– Ability
to
launch
VMs
– S3
like
storage
-‐
Swid
• Cassandra
– Key-‐value
store
like
S3
– Columns
like
BigTable
• Many
other
open
source
Amazon
style
services
available.
15. Some
Programming
Models
for
Data
Centers
• Opera+ons
over
data
center
of
disks
– MapReduce
(“string-‐based”
scans
of
data)
– User-‐Defined
Func+ons
(UDFs)
over
data
center
– Launch
VMs
that
all
have
access
to
highly
scalable
and
available
disk-‐based
data.
– SQL
and
NoSQL
over
data
center
• Opera+ons
over
data
center
of
memory
– Grep
over
distributed
memory
– UDFs
over
distributed
memory
– Launch
VMs
that
all
have
access
to
highly
scalable
and
available
membory-‐based
data.
– SQL
and
NoSQL
over
distributed
memory
16. Sec+on
3.2
Processing
Data
By
Scaling
Out
Virtual
Machines
17. Processing
Big
Data
PaCern
1:
Launch
Independent
Virtual
Machines
and
Task
with
a
Messaging
Service
18. Task
With
Messaging
Service
Task
&
Use
S3
(Variant
1)
VM
Control
VM:
Launches
and
tasks
workers
Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)
Worker
VMs
Task
Task
Task
…
VM
VM
VM
S3
19. Task
With
Messaging
Service
Task
&
Use
NoSQL
DB
(Variant
2)
VM
Control
VM:
Launches
and
tasks
workers
Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)
Worker
VMs
Task
Task
Task
…
VM
VM
VM
AWS
SimpleDB
20. Task
With
Messaging
Service
Task
&
Use
Clustered
FS
(Variant
3)
VM
Control
VM:
Launches
and
tasks
workers
Messaging
Services
(AWS
SMS,
AMQP
Service,
etc.)
Worker
VMs
Task
Task
Task
…
VM
VM
VM
GlusterFS
22. Core
Concepts
• Data
are
(key,
value)
pairs
and
that’s
it
• Par++on
data
over
commodity
nodes
filling
racks
in
a
data
center.
• Sodware
handles
failures,
restarts,
etc.
This
is
the
hard
part.
• Basic
examples:
– Word
Count
– Inverted
index
24. Map
Task
Reduce
Map
Map
Task
Tracker
Task
Task
Task
HDFS
HDFS
local
disk
local
disk
Map
Task
Map
Map
Task
Tracker
Task
Task
Reduce
Task
HDFS
local
disk
HDFS
Map
Task
Map
Map
Task
Tracker
local
disk
Task
Task
HDFS
local
disk
Shuffle
&
Sort
25. Example:
Word
Count
&
Inverted
Index
• How
do
you
count
the
words
in
a
million
books?
– (best,
7)
• Inverted
index:
– (best;
page
1,
page
82,
…)
– (worst;
page
1,
page
12,
…)
Cover
of
serial
Vol.
V,
1859,
London
26. • Assume
you
have
a
cluster
of
50
computers,
each
with
an
aCached
local
disk
and
half
full
of
web
pages.
• What
is
a
simple
parallel
programming
framework
that
would
support
the
computa+on
of
word
counts
and
inverted
indices?
27. Basic
PaCern:
Strings
1.
Extract
words
2.
Hash
and
3.
Count
(or
from
web
pages
in
sort
words.
construct
inverted
parallel.
index)
in
parallel.
28. What
about
data
records?
1.
Extract
words
2.
Hash
and
3.
Count
(or
from
web
pages
in
sort
words.
construct
inverted
parallel.
index)
in
parallel.
1.
Extract
binned
2.
Hash
and
3.
Count
(or
field
value
from
sort
binned
construct
inverted
data
records
in
field
values.
index)
in
parallel.
parallel.
29. Map-‐Reduce
Example
• Input
is
files
with
one
document
per
record
• User
specifies
map
func+on
– key
=
document
URL
– Value
=
document
contents
Input
of
map
doc
cdickens
two
ci+es ,
it
was
the
best
of
+mes
Output
of
map
it ,
1
was ,
1
the ,
1
best ,
1
30. Example
(cont d)
• MapReduce
library
gathers
together
all
pairs
with
the
same
key
value
(shuffle/sort
phase)
• The
user-‐defined
reduce
func+on
combines
all
the
values
associated
with
the
same
key
Input
of
reduce
key
=
it
key
=
was
key
=
best
key
=
worst
values
=
1,
1
values
=
1,
1
values
=
1
values
=
1
Output
of
reduce
it ,
2
was ,
2
best ,
1
worst ,
1
31. Why
Is
Word
Count
Important?
• It
is
one
of
the
most
important
examples
for
the
type
of
text
processing
oden
done
with
MapReduce.
• There
is
an
important
mapping
document
<
-‐-‐-‐-‐-‐
>
data
record
words
<
-‐-‐-‐-‐-‐
>
(field,
value)
Inversion
32. Pleasantly
Parallel
MapReduce
Data
structure
Arbitrary
(key,
value)
pairs
Func+ons
Arbitrary
Map
&
Reduce
Middleware
MPI
(message
Hadoop
passing)
Ease
of
use
Difficult
Medium
Scope
Wide
Narrow
Challenge
Geung
something
Moving
to
working
MapReduce
33. Common
MapReduce
Design
PaCerns
• Word
count
• Inversion
–
inverted
index
• Compu+ng
simple
sta+s+cs
• Compu+ng
windowed
sta+s+cs
• Sparse
matrix
(document-‐term,
data
record-‐
FieldBinValue,
…)
•
Site-‐en+ty
sta+s+cs
• PageRank
• Par++oned
and
ensemble
models
• EM
37. Idea
1:
Apply
User
Defined
Func+ons
(UDF)
to
Files
in
a
Distributed
File
System
map/shuffle reduce
UDF UDF
This
generalizes
Hadoop’s
implementa+on
of
MapReduce
over
the
Hadoop
Distributed
File
system.
38. Idea
2:
Add
Security
From
the
Start
Security • Security
server
maintains
Master Client informa+on
about
users
Server
SSL and
slaves.
SSL
• User
access
control:
password
and
client
IP
address.
AAA data • File
level
access
control.
• Messages
are
encrypted
over
SSL.
Cer+ficate
is
used
for
authen+ca+on.
• Sector
is
a
good
basis
for
HIPAA
compliant
Slaves applica+ons.
39. Idea
3:
Extend
the
Stack
to
Include
Network
Transport
Services
Compute
Services
Compute
Services
Data
Services
Data
Services
Storage
Services
Storage
Services
Rou+ng
&
Google,
Hadoop
Transport
Services
Sector
39
40. Sec+on
3.5
Compu+ng
With
Streams:
Warming
Up
With
Means
and
Variances
41. Warm
Up:
Par++oned
Means
Step
1.
Compute
local
(Σ
xi,
Σ
xi2,
ni)
in
parallel
for
each
par++on.
Step
2.
Compute
global
mean
and
variance
from
these
tuples.
• Means
and
variances
cannot
be
computed
naively
when
the
data
is
in
distributed
par++ons.
42. Trivial
Observa+on
1
If
si
=
Σ
xi
is
a
the
i’th
local
means,
then
global
mean
=
Σ
si
/
Σ
ni.
• If
local
means
for
each
par++on
are
passed
(without
corresponding
counts),
then
there
is
not
enough
informa+on
to
compute
global
means.
• Same
tricks
works
for
variance,
but
need
to
pass
triples
(Σ
xi,
Σ
xi2,
ni).
43. Trivial
Observa+on
2
• To
reduce
data
passed
over
the
network,
combine
appropriate
sta+s+cs
as
early
as
possible.
• Consider
average.
Recall
with
MapReduce
there
are
4
steps
(Map,
Shuffle,
Sort
and
Reduce)
and
Reduce
pulls
data
from
local
disk
that
performs
Map.
• A
Combine
Step
in
MapReduce
combines
local
data
before
it
is
pulled
for
Reduce
Step.
• There
are
built
in
combiners
for
counts,
means,
etc.
46. Hadoop
Streams
• In
addi+on
to
the
Java
API,
Hadoop
offers
– Streaming
interface
for
any
language
that
supports
reading
and
wri+ng
to
Standard
In
and
Out
– Pipes
for
C++
• Why
would
I
want
to
use
something
besides
Java?
Because
Hadoop
Streams
provide
direct
access
to
– (Without
JNI/
NIO)
to
C++
libraries
like
Boost,
GNU
Scien+fic
Library
(GSL)
– R
modules
47. Pros
and
Cons
• Java
+
Best
documented
+
Largest
community
– More
LOC
per
MR
job
• Python
+
Efficient
memory
handling
+
Programmers
can
be
very
efficient
– Limited
logging
/
debugging
• R
+
Vast
collec+on
of
sta+s+cal
algorithms
– Poor
error
handling
and
memory
handling
– Less
familiar
to
developers
48. Word
Count
Python
Mapper
def read_input(file):
for line in file:
yield line.split()
def main(separator='t'):
data = read_input(sys.stdin)
for words in data:
for word in words:
print '%s%s%d' % (word, separator, 1)
49. Word
Count
Python
Reducer
def read_mapper_output(file, separator='t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(sep='t'):
data = read_mapper_output(sys.stdin, sep=sepa)
for word, group in groupby(data, itemgetter(0)):
total_count = sum(int(count) for word,
count in group)
print "%s%s%d" % (word, sep, total_count)
50. MalStone
Benchmark
MalStone
A
MalStone
B
Hadoop
MapReduce
455m
13s
840m
50s
Hadoop
Streams
87m
29s
142m
32s
(Python)
C++
implemented
UDFs
33m
40s
43m
44s
Sector/Sphere
1.20,
Hadoop
0.18.3
with
no
replica+on
on
Phase
1
of
Open
Cloud
Testbed
in
a
single
rack.
Data
consisted
of
20
nodes
with
500
million
100-‐byte
records
/
node.
51. Word
Count
R
Mapper
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)",
"", line)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn =
FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}
close(con)
52. Word
Count
R
Reducer
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) >
0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
53. Word
Count
R
Reducer
(cont’d)
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "t", get(w, envir = env), "n", sep =
"”)
54. Word
Count
Java
Mapper
public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable>
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context
context
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
55. Word
Count
Java
Reducer
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
56. Code
Comparison
–
Word
Count
Mapper
Python
Java
def read_input(file):
for line in file: public static class Map
yield line.split() extends Mapper<LongWritable, Text,Text, IntWritable>
def main(separator='t'): private final static IntWritable one = new IntWritable(1);
data = read_input(sys.stdin) private Text word = new Text();
for words in data:
for word in words: public void map(LongWritable key, Text value, Context context
print '%s%s%d' % (word, separator, 1) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
R
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}
close(con)
57. Code
Comparison
–
Word
Count
Reducer
Python if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
def read_mapper_output(file, separator='t'): }
for line in file: else assign(word, count, envir = env)
yield line.rstrip().split(separator, 1) }
close(con)
def main(sep='t'): for (w in ls(env, all = TRUE))
data = read_mapper_output(sys.stdin, sep=sepa) cat(w, "t", get(w, envir = env), "n", sep = "”)
for word, group in groupby(data, itemgetter(0)):
total_count = sum(int(count) for word, count in group)
print "%s%s%d" % (word, sep, total_count)
Java
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
R throws IOException, InterruptedException {
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) int sum = 0;
for (IntWritable val : values) {
splitLine <- function(line) { sum += val.get();
val <- unlist(strsplit(line, "t")) }
list(word = val[1], count = as.integer(val[2])) context.write(key, new IntWritable(sum));
} }
}
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
58. Ques+ons?
For
the
most
current
version
of
these
notes,
see
rgrossman.com