3. Two
ways
to
deal
with
this:
Immediate
GraDficaDon
Long
term
$$$
costs
Misery
&
sleep
deprivaDon
Long
term
benefits
4. The
Crying
Baby
Problem
Wants
A(en*on
Now!
≈
The
ImpaDent
Boss
Problem
Wants
Answers
Now!
5. Two
ways
to
analyze
data
MapReduce
way
Immediate
GraDficaDon
Hack it:
Locate
Determine Parse Long-‐term
cumulaDve
costs
Key
File +Map because
MR
is
slow!
Attributes
+Reduce
DB
&
HadoopDB
way
Organize Query:
Figure Determine Process
Locate or Index
out Load File Key
File DB without
schema Attributes
tables Parse
Misery
&
sleep
deprivaDon
Long
term
benefits
6. The
Problem
Can
we
get
the
immediate
gra*fica*on
of
working
with
MapReduce
and
make
progress
towards
the
performances
advantages
of
working
with
Databases?
7. Our
SoluDon
Begin
with
the
MapReduce
Way
File System
Write
Determine Map/
Locate Run it!
Key Reduce
File
Attributes Scripts
Database
System
BEHIND-‐THE-‐SCENCES
PER
JOB
Organize
Figure or Index
out Load File DB
schema
tables
INCREMENTALLY
8. Figure
out
P1)
How
to
automaDcally
figure
schema
out
a
schema?
Short
answer:
DON’T
Split
map
phase
into
Parse
and
Map
phases.
Enforce
a
simple
Parse
API:
Parser
has
one
output
method:
getField(int
id)
Name
a
table
aZer
its
Parser-‐implementaDon
and
label
a[ributes
with
their
field
id.
Different
parsers
on
the
same
file
result
in
different
tables.
9. Incrementally
P2)
How
to
load
files
with
minimal
marginal
costs?
Load File
• Load
only
touched
a[ributes
(VerDcal
ParDDon)
– Requires
a
Column-‐Store
• Load
only
parts
of
a
column
(Horizontal
parDDon)
– AZer
a
file-‐split
is
processed
by
Map,
its
touched
a[ributes
are
loaded
enDrely
– How
many
splits
of
a
file
is
a
tunable
parameter.
10. Tuple
construcDon
Some
columns
are
at
different
loading
stages.
– Maintain
OIDs
for
each
column:
an
address
column
• The
OIDs
assigned
are
equivalent
to
the
inserDon
order
– Keep
a
catalog
to
track
loading
progress
a b c d
Process
in
DB
Use
File
System
11. Incrementally
P3)
How
to
index
a
parDally-‐
loaded
table?
Organize file
If
a
selec*on
filter
is
applied
on
an
a(ribute,
we
organize
it.
Dealing
with
parDally
loaded
a[ributes
c1 c2
address $ ! # &
c1 % " $ '
column
# ## % (
!!"#$$$ ! !! ) %
%"#$$$ ) % * &
&"#$$$ * & ! !!
JOIN
!"#$$$ ( ! ( !
'"#$$$ & ' , (
&"#$$$ + & + &
("#$$$ , ( & ' ! '
!%"#$$$ ' !% ' !% " (
## &
15. Setup
• Single-‐Machine
Experiments
– Embarrassingly
parallel
– No
distributed
reorganizaDon
or
parDDoning
• MonetDB
(hacked
to
support
IMS)
• Hadoop
• 2
GB
file
of
5
integer
a[ributes:
107,374,182
tuples.
• See
paper
for
more
details
16. The
big
picture
800
SQL Pre-load
Incremental Reorganization (5/5)
Incremental Reorganization (2/5)
700 Invisible Loading (5/5)
Invisible Loading (2/5)
MapReduce
600
500
Time in Seconds
400
300
200
100
0
1 10 100
Job Sequence
19. Further
EvaluaDon
(Paper)
• In-‐depth
study
of
IMS
– Comparison
with
Cracking
and
Pre-‐sorDng
– Effect
of
integraDng
Lightweight
compressions
into
IMS.
• Li[le
mini-‐experiments
– InserDon
vs.
Copy
– Processing
in
DB
vs.
using
DB
as
a
fast
access
medium
with
all
processing
in
MapReduce
20. Conclusion:
Lessons
Learned
• Engineering
Nightmare
– Many
complemenDng
technologies
• Manimal,
AdapDve
Merging
…
– In
the
era
of
Big-‐Data
we
need
to
design
more
modular,
plug-‐n-‐play
tools
• Can
of
worms
– Most
BigData
problems
look
decepDvely
simple
unDl
you
start
mocking
around.
23. Why
is
loading
this
log
file
hard?
!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>*%?%@A#/0:(-B*-C)5*D@EF%0/G/0/0,%H448,II'129
!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>-%?%@137J@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK
!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>B%?%@!PJ4#7/$4Q+PFPJ4#7/$4Q(PFPJ4#7/$4Q*PF
Message
field
!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>)%?%@21OO9$7@EF%0/G/0/0,%H448,II'129'H1J4I78
varies
!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>R%?%@/S9'#94/T4#8'/J@EF%0/G/0/0,%H448,II'129'H
!"#$%&#'%()%*(,+*,*+%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
What
is
the
depending
on
!Z1$%&#'%(R%(+,)),-*%*+(*.%!/0010.%!2'3/$4%(+C6((C6(+56D).%[$S9'37%O/4H17%3$%0/A#/J4%')]V/G]VC9]V7B]V729]V(2]]^]V(D]VD7]V9)6]V+-_
base
schema?
applicaDon!
!Z1$%&#'%(R%((,-*,*D%*+(*.%!/0010.%!2'3/$4%C*6B+6*)-6(*5.%[$S9'37%`>[%3$%0/A#/J4%]S]V2+]VCC1]VG-]V9B!]V/L#]VCRI]V+G]VL*;]V()]VG-C
!Z1$%&#'%(R%(*,(+,BD%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
Time,
Type,
!Z1$%&#'%(R%(*,(+,))%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
!N#/%&#'%(5%*+,((,B+%*+(*.%!$1432/.%29#aH4%"[bNc>ZF%JH#443$a%71K$
Message
?
!U03%&#'%*+%(C,(-,(C%*+(*.%!K90$.%[$34,%"/JJ31$%d92H/%3J%$14%21$G3a#0/7%!H3$4,%""W"/JJ31$d92H/.
H4487,%d1#'7%$14%0/'39L':%7/4/0O3$/%4H/%J/0S/0@J%G#'':%A#9'3G3/7%71O93$%$9O/F%#J3$a%/V2/''/$2/6'129'%G10%"/0S/0e9O/
!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%Y3a/J4,%a/$/0943$a%J/20/4%G10%73a/J4%9#4H/$4329431$%666
Different
tables
!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%Y3a/J4,%71$/ Context-‐dependent
for
each
type?
Schema
Awareness
!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%=892H/I*6*6*(%f`$3Vg%O17TJJ'I*6*6*(%h8/$""WI+6C6D0%Y=<I*%O17TG2a37I*6-6R%21$G3a#0/7%ii%0/J#O3$a%$
!U03%&#'%*+%(C,(-,*-%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
Different
analysts
know
!U03%&#'%*+%(C,(B,+D%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
!"94%&#'%*(%(R,-C,*5%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
the
schema
of
what
!"94%&#'%*(%(R,)+,*5%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
they
are
looking
for
and
!"94%&#'%*(%(5,+B,*)%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
!"94%&#'%*(%(5,+R,)R%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
don’t
care
about
other
!"94%&#'%*(%(5,)+,(-%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
log
messages
!"94%&#'%*(%(5,)+,*R%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
!"94%&#'%*(%(C,(C,*+%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
!"94%&#'%*(%(C,(C,--%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321
!N#/%&#'%*B%(5,)B,B)%*+(*.%!$1432/.%29#aH4%"[bNc>ZF%JH#443$a%71K$