Dataiku pig - hive - cascading

Pig Hive Cascading
Hadoop In Practice

}  Devoxx 2013
}  Florian Douetteau

About me

Florian Douetteau <florian.douetteau@dataiku.com>

}  CEO at Dataiku
}  Freelance at Criteo (Online Ads)
}  CTO at IsCool Ent. (#1 French Social Gamer)
}  VP R&D Exalead (Search Engine Technology)

Dataiku Training – Hadoop for Data Science 4/14/13 2

Agenda

}  Hadoop and Context (->0:03)
}  Pig, Hive, Cascading, … (->0:06)
}  How they work (->0:09)
}  Comparing the tools (->0:25)
}  Wrap’up and question (->0:)

Dataiku - Pig, Hive and Cascading

CHOOSE TECHNOLOGY
NoSQL-Slavia! Scalability Central! Machine Learning !
Mystery Land!
Elastic Search
Hadoop Scikit-Learn
SOLR Ceph

MongoDB Cassandra
Sphere Mahout
WEKA
Riak MLBase LibSVM
Membase
Spark
SQL Colunnar Republic!
InfiniDB
SAS
RapidMiner
R
Vertica SPSS
Panda
GreenPlum QlickView Pig
Impala Tableau
Statistician Old !
Netezza SpotFire Cascading
Talend House!
HTML5/D3
Vizualization County!
Data Clean Wasteland!

How do I (pre)process data?
Implicit User Data
(Views, Searches…)

Online User Information

Transformation
500TB Predictor
Transformation
Matrix

Explicit User Data Predictor Runtime
(Click, Buy, …)

Per User Stats Rank Predictor

50TB

Per Content Stats

User Information
(Location, Graph…)
User Similarity

1TB

Content Data
(Title, Categories, Price, …)

200GB Content Similarity

A/B Test Data


Typical Use Case 1 
Web Analytics Processing
}  Analyse Raw Logs
(Trackers, Web Logs)
}  Extract IP, Page, …
}  Detect and remove
robots
}  Build Statistics
◦  Number of page view, per
produt
◦  Best Referers
◦  Traffic Analysis
◦  Funnel
◦  SEO Analysis
◦  …


Mining Search Logs for Synonyms
}  Extract Query Logs
}  Perform query
normalization
}  Compute Ngrams
}  Compute Search
“Sessions”
}  Compute Log-
Likehood Ratio for
ngrams across
sesions


Product Recommender
}  Compute User –
Product Association
Matrix
}  Compute different
similarities ratio
(Ochiai, Cosine, …)
}  Filter out bad
predictions
}  For each user, select
best recommendable
products


Agenda

}  Hadoop and Context
}  Pig, Hive, Cascading, …
}  How they work
}  Comparing the tools


Pig History

}  Yahoo Research in 2006
}  Inspired from Sawzall, a Google Paper
from 2003
}  2007 as an Apache Project

}  Initial motivation
◦  Search Log Analytics: how long is the
average user session ? how many links does
a user click ? on before leaving a website ?
how do click patterns vary in the course of a
day/week/month ? …

words = LOAD '/training/hadoop-wordcount/output‘
USING PigStorage(‘t’)
AS (word:chararray, count:int);

sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;

DUMP first_words;


Hive History

}  Developed by Facebook in January 2007

}  Open source in August 2008

}  Initial Motivation
◦  Provide a SQL like abstraction to perform
statistics on status updates

create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';

select * from wordcounts order by count desc limit
10;

select SUM(count) from wordcounts where word like
‘th%’;

Cascading History

}  Authored by Chris Wensel 2008

}  Associated Projects
◦  Cascalog : Cascading in Closure
◦  Scalding : Cascading in Scala (Twitter
in 2012)
◦  Lingual ( to be released soon): SQL
layer on top of cascading


MapReduce 
Simplicity is a complexity

Dataiku - Innovation Services 4/14/13 14

Pig & Hive 
Mapping to Mapreduce jobs
events = LOAD ‘/events’ USING PigStorage(‘t’) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;

Job 1 : Mapper Job 1 : Reducer1
LOAD FILTER GROUP FOREACH FILTER
Shuffle and  
sort by user

* VAT excluded


Pig & Hive 
Mapping to Mapreduce jobs
events = LOAD ‘/events’ USING PigStorage(‘t’) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
recent_high = ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO ‘/output’;

Job 1: Mapper Job 1 :Reducer
LOAD FILTER GROUP FOREACH FILTER
Shuffle and  
sort by user

Job 2: Mapper Job 2: Reducer
LOAD 
Shuffle and   STORE
(from tmp)
sort by max_ts


Pig  
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)
84 TResolution = LOAD '$PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY' USING PigStorage('u0001');
85 TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId;
86
87
88 TSiteMap = LOAD '$PREFIX/dwh_dim_sitemapnode/dt=$DAY' USING PigStorage('u0001');
89 TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId;
90
91
92 TCustomer = LOAD '$PREFIX/customer_relation/dt=$DAY' USING PigStorage('u0001')
93 as (SKCustomerId:chararray,
94 CustomerId:chararray);
95
96 F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, 'yyyy-MM-dd HH:mm:ss'
97
98 F2 = FOREACH F1 GENERATE *,
99 CONCAT(CONCAT(CONCAT(CONCAT(visid_high,'-'), visid_low), '-'), visit_num) as VisitId,
100 (referrer matches '.*cdiscount.com.*' OR referrer matches 'cdscdn.com' ? NULL :referrer ) as Referrer,
101 (iso IS NOT NULL ? ISODaysBetween(iso, '1899-12-31T00:00:00') : NULL)
102 AS SkDateId,
103 (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL)
104 AS SkTimeId,
105 ((event_list is not null and event_list matches '.*b202b.*') ? 'Y' : 'N') as is_202,
112 REGEX_EXTRACT(pagename, 'F-(.*):.*', 1) AS ProductReferenceId,
113 NULL AS OriginFile;
114
115 SET DEFAULT_PARALLEL 24;
116
117 F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING 'replicated' PARALLEL 20 ;
118 F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? '-1' : SKSearchEngineId) as SKSearchEngineId;
119 --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId;
120
121 F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING 'replicated' PARALLEL 20;
122 F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? '-1' : SKBrowserId) as SKBrowserId;
123
124 --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId;
125
126
127 F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING 'replicated' PARALLEL 20;
128 F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? '-1' : SKOperatingSystemId) as SKOperatingSystemId;
129
130 --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId;
131
132
133 F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING 'replicated' PARALLEL 20;
134 F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? '-1' : SKResolutionId) as SKResolutionId;
135
136 --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId;
137
138 F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING 'replicated' PARALLEL 20;
139 F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? '-1' : SKSimteMapNodeId) as SKSimteMapNodeId;
140
141 --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId;
142
143
144 SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL;
145
146 F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING 'skewed' PARALLEL 20;
147 WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
148
149 --F8_UNION = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
150 F8_UNION = UNION F8, WITHOUT_CUSTOMER;
151 --DESCRIBE F8;
152 --DESCRIBE WITHOUT_CUSTOMER;
153 --DESCRIBE F8_UNION;
154
155 F9 = FOREACH F8_UNION GENERATE
156 visid_high,
157 visid_low,
158 VisitId,

159 post_evar30,
160 SKCustomerId,
161 visit_num,
162 SkDateId,
163 SkTimeId,
164 post_evar16,
165 post_evar52,
166 visit_page_num,
167 is_202,
168 is_10,
169 is_12,


Cascading 
From Code To Jobs


Hive Joins 
How to join with MapReduce ?

Uid Tbl_idx Name Type
tbl_idx uid name Uid Name Type
1 1 Dupont
1 1 Dupont 1 Dupont Type1
1 2 Type1
1 2 Durand 1 Dupont Type2
1 2 Type2

Shuffle by uid
Reducer 1
Sort by (uid, tbl_idx)

tbl_idx uid type
Uid Tbl_idx Name Type
2 1 Type1 Uid Name Type
2 1 Durand
2 1 Type2 2 Durand Type1
2 2 Type1
2 2 Type1

Mappers output Reducer 2


Comparing without Comparable

}  Philosophy
◦  Procedural Vs Declarative
◦  Data Model and Schema
}  Productivity
◦  Headachability
◦  Checkpointing
◦  Testing and environment
}  Integration
◦  Partitioning
◦  Formats Integration
◦  External Code Integration
}  Performance and optimization


Procedural Vs Declarative

}  Transformation as a }  Transformation as a
sequence of set of formulas
operations
Users

=
load
'users'
as
(name,
age,
ipaddr);

Clicks

=
load
'clicks'
as
(user,
url,
value);

ValuableClicks

=
filter
Clicks
by
value
>
0;
insert
into
ValuableClicksPerDMA
select

UserClicks

=
join
Users
by
name,
ValuableClicks
by
dma,
count(*)

user;
from
geoinfo
join
(

Geoinfo

=
load
'geoinfo'
as
(ipaddr,
dma);

select
name,
ipaddr
from

UserGeo

=
join
UserClicks
by
ipaddr,
Geoinfo
by
users
join
clicks
on
(users.name
=

ipaddr;
clicks.user)

ByDMA

=
group
UserGeo
by
dma;

where
value
>
0;

=
foreach
ByDMA
generate
group,

)
using
ipaddr

COUNT(UserGeo);
group
by
dma;

store
into
'ValuableClicksPerDMA';


Data type and Model 
Rationale
}  All three Extend basic data model with extended
data types
◦  array-like [ event1, event2, event3]
◦  map-like { type1:value1, type2:value2, …}

}  Different approach
◦  Resilient Schema
◦  Static Typing
◦  No Static Typing


Hive 
Data Type and Schema
CREATE TABLE visit (
user_name STRING,
user_id INT,
user_details STRUCT<age:INT, zipcode:INT>
);

Simple type Details
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes
FLOAT, DOUBLE 4 and 8 bytes
BOOLEAN

STRING Arbitrary-length, replaces VARCHAR
TIMESTAMP

Complex type Details
ARRAY Array of typed items (0-indexed)
MAP Associative map
STRUCT Complex class-like objects


Data types and Schema 
Pig

rel = LOAD '/folder/path/'
USING PigStorage(‘t’)
AS (col:type, col:type, col:type);

Simple type Details
int, long, float, 32 and 64 bits, signed
double
chararray A string
bytearray An array of … bytes
boolean A boolean

tuple a tuple is an ordered fieldname:value map
bag a bag is a set of tuples


Data Type and Schema  
Cascading
}  Support for Any Java Types, provided they can be
serialized in Hadoop
}  No support for Typing
Simple type Details
Int, Long, Float, 32 and 64 bits, signed
Double
String A string
byte[] An array of … bytes
Boolean A boolean

Object Object must be « Hadoop serializable »


Style Summary

Style Typing Data Model Metadata
store
Pig Procedural Static + scalar + No
Dynamic tuple+ bag (HCatalog)
(fully
recursive)
Hive Declarative Static + scalar+ list Integrated
Dynamic, + map
enforced at
execution
time
Cascading Procedural Weak scalar+ java No
objects



}  Philosophy
◦  Testing, error management and environment
}  Integration
◦  Partitioning


Headachility 
Motivation
}  Does debugging
the tool lead to bad
headaches ?


Headaches 
Pig
}  Out Of Memory Error (Reducer)

}  Exception in Building /
Extended Functions  
(handling of null)

}  Null vs “”

}  Nested Foreach and scoping

}  Date Management (pig 0.10)

}  Field implicit ordering


A Pig Error


Headaches 
Hive
}  Out of Memory Errors in
Reducers

}  Few Debugging Options

}  Null / “”

}  No builtin “first”


Headaches 
Cascading
}  Weak Typing Errors (comparing
Int and String … )

}  Illegal Operation Sequence
(Group after group …)

}  Field Implicit Ordering


Testing 
Motivation
}  How to perform unit tests ?
}  How to have different versions of the same script
(parameter) ?


Testing 
Pig
}  System Variables
}  Comment to test
}  No Meta Programming
}  pig –x local to execute on local files


Testing / Environment  
Cascading
}  Junit Tests are possible
}  Ability to use code to actually comment out some
variables


Checkpointing  
Motivation
}  Lots of iteration while developing on Hadoop
}  Sometime jobs fail
}  Sometimes need to restart from the start …

Parse Logs Per Page Stats Page User Correlation Filtering Output

FIX and relaunch


Pig 
Manual Checkpointing
}  STORE Command to manually  
store files


// COMMENT Beginning
of script and relaunch


Cascading  
Automated Checkpointing
}  Ability to re-run a
flow automatically
from the last saved
checkpoint

addCheckpoint(…)


Cascading  
Topological Scheduler
}  Check each file intermediate timestamp
}  Execute only if more recent



Productivity Summary

Headaches Checkpointing/ Testing /
Replay Metaprogrammation

Pig Lots Manual Save Difficult

Hive Few, but None (That’s SQL) None (That’s SQL)
without
debugging
options
Cascading Weak Typing Checkpointing Possible
Complexity Partial Updates



}  Philosophy
◦  Testing and environment
}  Integration
◦  Partitioning


Formats Integration 
Motivation
}  Ability to integrate different file formats
◦  Text Delimited
◦  Sequence File (Binary Hadoop format)
◦  Avro, Thrift ..
}  Ability to integrate with external data sources or
sink ( MongoDB, ElasticSearch, Database. …)

Format impact on size and performance

Format Size on Disk (GB) HIVE Processing time (24 cores)

Text File, uncompressed 18.7 1m32s

1 Text File, Gzipped 3.89 6m23s
(no parallelization)

JSON compressed 7.89 2m42s

multiple text file gzipped 4.02 43s

Sequence File, Block, Gzip 5.32 1m18s

Text File, LZO Indexed 7.03 1m22s


Format Integration 

}  Hive: Serde (Serialize-Deserializer)
}  Pig : Storage
}  Cascading: Tap


Partitions 
Motivation
}  No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
}  Common partition schemas on Hadoop
◦  By Date /apache_logs/dt=2013-01-23
◦  By Data center /apache_logs/dc=redbus01/…
◦  By Country
◦  …
◦  Or any combination of the above


Hive Partitioning 
Partitioned tables
CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure

/hive/event/day=2013-01-27/server_id=s1/file0
…

INSERT
OVERWRITE
TABLE

event
PARTITION(ds='2013-‐01-‐27',

server_id=‘s1’)

SELECT
*
FROM
event_tmp;


Cascading Partition

}  No Direct support for partition
}  Support for “Glob” Tap, to build read from files
using patterns 

}  è You can code your own custom or virtual
partition schemes


External Code Integration 
Simple UDF
Pig Hive

Cascading


Hive Complex UDF 
(Aggregators)


Cascading  
Direct Code Evaluation


Integration 
Summary

Partition/ External Code Format
Incremental Integration
Updates
Pig No Direct Simple Doable and rich
Support community
Hive Fully integrated, Very simple, but Doable and
SQL Like complex dev setup existing
community
Cascading With Coding Complex UDFS Doable and
but regular, and growing
Java Expression commuinty
embeddable


Optimization

}  Several Common Map Reduce Optimization
Patterns
◦  Combiners
◦  MapJoin
◦  Job Fusion
◦  Job Parallelism
◦  Reducer Parallelism
}  Different support per framework
◦  Fully Automatic
◦  Pragma / Directives / Options
◦  Coding style / Code to write


Combiner 
Perform Partial Aggregate at Mapper Stage

SELECT
date,
COUNT(*)
FROM
product
GROUP
BY
date

2012-‐02-‐14
4354

Map …
Reduce
2012-‐02-‐14
4354
2012-‐02-‐14
20

2012-‐02-‐15
21we2

…

2012-‐02-‐15
21we2

2012-‐02-‐15
35

2012-‐02-‐14
qa334

…

2012-‐02-‐15
23aq2

2012-‐02-‐14
qa334

2012-‐02-‐16
1

…

2012-‐02-‐15
23aq2


Combiner 
Perform Partial Aggregate at Mapper Stage

SELECT
date,
COUNT(*)
FROM
product
GROUP
BY
date

Map Reduce
2012-‐02-‐14
4354
2012-‐02-‐14
8
2012-‐02-‐14
20

…
2012-‐02-‐15
12

2012-‐02-‐15
21we2

2012-‐02-‐15
35

2012-‐02-‐14
qa334

…

2012-‐02-‐15
23aq2

2012-‐02-‐14
12
2012-‐02-‐16
1

2012-‐02-‐15
23

2012-‐02-‐16
1

Reduced network bandwith. Better parallelism


Join Optimization 
Map Join

Hive
set hive.auto.convert.join = true;
Pig

Cascading

( no aggregation support after HashJoin)


Number of Reducers

}  Critical for performance

}  Estimated per the size of input file
◦  Hive
–  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦  Pig
–  divide size pig.exec.reducers.bytes.per.reducer (default 1GB)


Performance & Optimization  
Summary

Combiner Join Number of
Optimization Optimization reducers
optimization

Pig Automatic Option Estimate or DIY
Cascading DIY HashJoin DIY
Hive Partial Automatic Estimate or DIY
DIY (Map Join)


Agenda

}  Hadoop and Context (->0:03)
}  Pig, Hive, Cascading, … (->0:06)
}  How they work (->0:09)
}  Comparing the tools (->0:25)
}  Wrap’up and question (->0:30)


}  Want to keep close to SQL ?
◦  Hive
}  Want to write large flows ?
◦  Pig
}  Want to integrate in large scale programming
projects
◦  Cascading (cascalog / scalding)


Dataiku pig - hive - cascading

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Dataiku pig - hive - cascading

Semelhante a Dataiku pig - hive - cascading (20)

Mais de Dataiku

Mais de Dataiku (20)

Dataiku pig - hive - cascading