Data Disruption for Insurance - Perspective from th
Dataiku pig - hive - cascading
1. Pig Hive Cascading
Hadoop In Practice
} Devoxx 2013
} Florian Douetteau
2. About me
Florian Douetteau <florian.douetteau@dataiku.com>
} CEO at Dataiku
} Freelance at Criteo (Online Ads)
} CTO at IsCool Ent. (#1 French Social Gamer)
} VP R&D Exalead (Search Engine Technology)
Dataiku Training – Hadoop for Data Science 4/14/13 2
3. Agenda
} Hadoop and Context (->0:03)
} Pig, Hive, Cascading, … (->0:06)
} How they work (->0:09)
} Comparing the tools (->0:25)
} Wrap’up and question (->0:)
Dataiku - Pig, Hive and Cascading
5. How do I (pre)process data?
Implicit User Data
(Views, Searches…)
Online User Information
Transformation
500TB Predictor
Transformation
Matrix
Explicit User Data Predictor Runtime
(Click, Buy, …)
Per User Stats Rank Predictor
50TB
Per Content Stats
User Information
(Location, Graph…)
User Similarity
1TB
Content Data
(Title, Categories, Price, …)
200GB Content Similarity
A/B Test Data
Dataiku - Pig, Hive and Cascading
6. Typical Use Case 1
Web Analytics Processing
} Analyse Raw Logs
(Trackers, Web Logs)
} Extract IP, Page, …
} Detect and remove
robots
} Build Statistics
◦ Number of page view, per
produt
◦ Best Referers
◦ Traffic Analysis
◦ Funnel
◦ SEO Analysis
◦ …
Dataiku - Pig, Hive and Cascading
7. Typical Use Case 2
Mining Search Logs for Synonyms
} Extract Query Logs
} Perform query
normalization
} Compute Ngrams
} Compute Search
“Sessions”
} Compute Log-
Likehood Ratio for
ngrams across
sesions
Dataiku - Pig, Hive and Cascading
8. Typical Use Case 3
Product Recommender
} Compute User –
Product Association
Matrix
} Compute different
similarities ratio
(Ochiai, Cosine, …)
} Filter out bad
predictions
} For each user, select
best recommendable
products
Dataiku - Pig, Hive and Cascading
9. Agenda
} Hadoop and Context
} Pig, Hive, Cascading, …
} How they work
} Comparing the tools
Dataiku - Pig, Hive and Cascading
10. Pig History
} Yahoo Research in 2006
} Inspired from Sawzall, a Google Paper
from 2003
} 2007 as an Apache Project
} Initial motivation
◦ Search Log Analytics: how long is the
average user session ? how many links does
a user click ? on before leaving a website ?
how do click patterns vary in the course of a
day/week/month ? …
words = LOAD '/training/hadoop-wordcount/output‘
USING PigStorage(‘t’)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;
Dataiku - Pig, Hive and Cascading
11. Hive History
} Developed by Facebook in January 2007
} Open source in August 2008
} Initial Motivation
◦ Provide a SQL like abstraction to perform
statistics on status updates
create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
‘th%’;
Dataiku - Pig, Hive and Cascading
12. Cascading History
} Authored by Chris Wensel 2008
} Associated Projects
◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter
in 2012)
◦ Lingual ( to be released soon): SQL
layer on top of cascading
Dataiku - Pig, Hive and Cascading
13. Agenda
} Hadoop and Context
} Pig, Hive, Cascading, …
} How they work
} Comparing the tools
Dataiku - Pig, Hive and Cascading
15. Pig & Hive
Mapping to Mapreduce jobs
events = LOAD ‘/events’ USING PigStorage(‘t’) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
Job 1 : Mapper Job 1 : Reducer1
LOAD FILTER GROUP FOREACH FILTER
Shuffle and
sort by user
* VAT excluded
Dataiku - Innovation Services 4/14/13 15
16. Pig & Hive
Mapping to Mapreduce jobs
events = LOAD ‘/events’ USING PigStorage(‘t’) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
recent_high = ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO ‘/output’;
Job 1: Mapper Job 1 :Reducer
LOAD FILTER GROUP FOREACH FILTER
Shuffle and
sort by user
Job 2: Mapper Job 2: Reducer
LOAD
Shuffle and STORE
(from tmp)
sort by max_ts
Dataiku - Innovation Services 4/14/13 16
17. Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)
84 TResolution = LOAD '$PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY' USING PigStorage('u0001');
85 TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId;
86
87
88 TSiteMap = LOAD '$PREFIX/dwh_dim_sitemapnode/dt=$DAY' USING PigStorage('u0001');
89 TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId;
90
91
92 TCustomer = LOAD '$PREFIX/customer_relation/dt=$DAY' USING PigStorage('u0001')
93 as (SKCustomerId:chararray,
94 CustomerId:chararray);
95
96 F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, 'yyyy-MM-dd HH:mm:ss'
97
98 F2 = FOREACH F1 GENERATE *,
99 CONCAT(CONCAT(CONCAT(CONCAT(visid_high,'-'), visid_low), '-'), visit_num) as VisitId,
100 (referrer matches '.*cdiscount.com.*' OR referrer matches 'cdscdn.com' ? NULL :referrer ) as Referrer,
101 (iso IS NOT NULL ? ISODaysBetween(iso, '1899-12-31T00:00:00') : NULL)
102 AS SkDateId,
103 (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL)
104 AS SkTimeId,
105 ((event_list is not null and event_list matches '.*b202b.*') ? 'Y' : 'N') as is_202,
106 ((event_list is not null and event_list matches '.*b10b.*') ? 'Y' : 'N') as is_10,
107 ((event_list is not null and event_list matches '.*b12b.*') ? 'Y' : 'N') as is_12,
108 ((event_list is not null and event_list matches '.*b13b.*') ? 'Y' : 'N') as is_13,
109 ((event_list is not null and event_list matches '.*b14b.*') ? 'Y' : 'N') as is_14,
110 ((event_list is not null and event_list matches '.*b11b.*') ? 'Y' : 'N') as is_11,
111 ((event_list is not null and event_list matches '.*b1b.*') ? 'Y' : 'N') as is_1,
112 REGEX_EXTRACT(pagename, 'F-(.*):.*', 1) AS ProductReferenceId,
113 NULL AS OriginFile;
114
115 SET DEFAULT_PARALLEL 24;
116
117 F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING 'replicated' PARALLEL 20 ;
118 F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? '-1' : SKSearchEngineId) as SKSearchEngineId;
119 --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId;
120
121 F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING 'replicated' PARALLEL 20;
122 F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? '-1' : SKBrowserId) as SKBrowserId;
123
124 --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId;
125
126
127 F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING 'replicated' PARALLEL 20;
128 F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? '-1' : SKOperatingSystemId) as SKOperatingSystemId;
129
130 --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId;
131
132
133 F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING 'replicated' PARALLEL 20;
134 F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? '-1' : SKResolutionId) as SKResolutionId;
135
136 --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId;
137
138 F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING 'replicated' PARALLEL 20;
139 F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? '-1' : SKSimteMapNodeId) as SKSimteMapNodeId;
140
141 --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId;
142
143
144 SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL;
145
146 F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING 'skewed' PARALLEL 20;
147 WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
148
149 --F8_UNION = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
150 F8_UNION = UNION F8, WITHOUT_CUSTOMER;
151 --DESCRIBE F8;
152 --DESCRIBE WITHOUT_CUSTOMER;
153 --DESCRIBE F8_UNION;
154
155 F9 = FOREACH F8_UNION GENERATE
156 visid_high,
157 visid_low,
158 VisitId,
159 post_evar30,
160 SKCustomerId,
161 visit_num,
162 SkDateId,
163 SkTimeId,
164 post_evar16,
165 post_evar52,
166 visit_page_num,
167 is_202,
168 is_10,
169 is_12,
Dataiku - Pig, Hive and Cascading
19. Hive Joins
How to join with MapReduce ?
Uid Tbl_idx Name Type
tbl_idx uid name Uid Name Type
1 1 Dupont
1 1 Dupont 1 Dupont Type1
1 2 Type1
1 2 Durand 1 Dupont Type2
1 2 Type2
Shuffle by uid
Reducer 1
Sort by (uid, tbl_idx)
tbl_idx uid type
Uid Tbl_idx Name Type
2 1 Type1 Uid Name Type
2 1 Durand
2 1 Type2 2 Durand Type1
2 2 Type1
2 2 Type1
Mappers output Reducer 2
Dataiku - Innovation Services 4/14/13 19
20. Agenda
} Hadoop and Context
} Pig, Hive, Cascading, …
} How they work
} Comparing the tools
Dataiku - Pig, Hive and Cascading
21. Comparing without Comparable
} Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
} Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
} Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
} Performance and optimization
Dataiku - Pig, Hive and Cascading
22. Procedural Vs Declarative
} Transformation as a } Transformation as a
sequence of set of formulas
operations
Users
=
load
'users'
as
(name,
age,
ipaddr);
Clicks
=
load
'clicks'
as
(user,
url,
value);
ValuableClicks
=
filter
Clicks
by
value
>
0;
insert
into
ValuableClicksPerDMA
select
UserClicks
=
join
Users
by
name,
ValuableClicks
by
dma,
count(*)
user;
from
geoinfo
join
(
Geoinfo
=
load
'geoinfo'
as
(ipaddr,
dma);
select
name,
ipaddr
from
UserGeo
=
join
UserClicks
by
ipaddr,
Geoinfo
by
users
join
clicks
on
(users.name
=
ipaddr;
clicks.user)
ByDMA
=
group
UserGeo
by
dma;
where
value
>
0;
ValuableClicksPerDMA
=
foreach
ByDMA
generate
group,
)
using
ipaddr
COUNT(UserGeo);
group
by
dma;
store
ValuableClicksPerDMA
into
'ValuableClicksPerDMA';
Dataiku - Pig, Hive and Cascading
23. Data type and Model
Rationale
} All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}
} Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing
Dataiku - Pig, Hive and Cascading
24. Hive
Data Type and Schema
CREATE TABLE visit (
user_name STRING,
user_id INT,
user_details STRUCT<age:INT, zipcode:INT>
);
Simple type Details
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes
FLOAT, DOUBLE 4 and 8 bytes
BOOLEAN
STRING Arbitrary-length, replaces VARCHAR
TIMESTAMP
Complex type Details
ARRAY Array of typed items (0-indexed)
MAP Associative map
STRUCT Complex class-like objects
Dataiku Training – Hadoop for Data Science 4/14/13 24
25. Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(‘t’)
AS (col:type, col:type, col:type);
Simple type Details
int, long, float, 32 and 64 bits, signed
double
chararray A string
bytearray An array of … bytes
boolean A boolean
Complex type Details
tuple a tuple is an ordered fieldname:value map
bag a bag is a set of tuples
Dataiku Training – Hadoop for Data Science 4/14/13 25
26. Data Type and Schema
Cascading
} Support for Any Java Types, provided they can be
serialized in Hadoop
} No support for Typing
Simple type Details
Int, Long, Float, 32 and 64 bits, signed
Double
String A string
byte[] An array of … bytes
Boolean A boolean
Complex type Details
Object Object must be « Hadoop serializable »
Dataiku - Pig, Hive and Cascading
27. Style Summary
Style Typing Data Model Metadata
store
Pig Procedural Static + scalar + No
Dynamic tuple+ bag (HCatalog)
(fully
recursive)
Hive Declarative Static + scalar+ list Integrated
Dynamic, + map
enforced at
execution
time
Cascading Procedural Weak scalar+ java No
objects
Dataiku - Pig, Hive and Cascading
28. Comparing without Comparable
} Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
} Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment
} Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
} Performance and optimization
Dataiku - Pig, Hive and Cascading
29. Headachility
Motivation
} Does debugging
the tool lead to bad
headaches ?
Dataiku - Pig, Hive and Cascading
30. Headaches
Pig
} Out Of Memory Error (Reducer)
} Exception in Building /
Extended Functions
(handling of null)
} Null vs “”
} Nested Foreach and scoping
} Date Management (pig 0.10)
} Field implicit ordering
Dataiku - Pig, Hive and Cascading
32. Headaches
Hive
} Out of Memory Errors in
Reducers
} Few Debugging Options
} Null / “”
} No builtin “first”
Dataiku - Pig, Hive and Cascading
33. Headaches
Cascading
} Weak Typing Errors (comparing
Int and String … )
} Illegal Operation Sequence
(Group after group …)
} Field Implicit Ordering
Dataiku - Pig, Hive and Cascading
34. Testing
Motivation
} How to perform unit tests ?
} How to have different versions of the same script
(parameter) ?
Dataiku - Pig, Hive and Cascading
35. Testing
Pig
} System Variables
} Comment to test
} No Meta Programming
} pig –x local to execute on local files
Dataiku - Pig, Hive and Cascading
36. Testing / Environment
Cascading
} Junit Tests are possible
} Ability to use code to actually comment out some
variables
Dataiku - Pig, Hive and Cascading
37. Checkpointing
Motivation
} Lots of iteration while developing on Hadoop
} Sometime jobs fail
} Sometimes need to restart from the start …
Parse Logs Per Page Stats Page User Correlation Filtering Output
FIX and relaunch
Dataiku - Pig, Hive and Cascading
38. Pig
Manual Checkpointing
} STORE Command to manually
store files
Parse Logs Per Page Stats Page User Correlation Filtering Output
// COMMENT Beginning
of script and relaunch
Dataiku - Pig, Hive and Cascading
39. Cascading
Automated Checkpointing
} Ability to re-run a
flow automatically
from the last saved
checkpoint
addCheckpoint(…)
Dataiku - Pig, Hive and Cascading
40. Cascading
Topological Scheduler
} Check each file intermediate timestamp
} Execute only if more recent
Parse Logs Per Page Stats Page User Correlation Filtering Output
Dataiku - Pig, Hive and Cascading
41. Productivity Summary
Headaches Checkpointing/ Testing /
Replay Metaprogrammation
Pig Lots Manual Save Difficult
Hive Few, but None (That’s SQL) None (That’s SQL)
without
debugging
options
Cascading Weak Typing Checkpointing Possible
Complexity Partial Updates
Dataiku - Pig, Hive and Cascading
42. Comparing without Comparable
} Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
} Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
} Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
} Performance and optimization
Dataiku - Pig, Hive and Cascading
43. Formats Integration
Motivation
} Ability to integrate different file formats
◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..
} Ability to integrate with external data sources or
sink ( MongoDB, ElasticSearch, Database. …)
Format impact on size and performance
Format Size on Disk (GB) HIVE Processing time (24 cores)
Text File, uncompressed 18.7 1m32s
1 Text File, Gzipped 3.89 6m23s
(no parallelization)
JSON compressed 7.89 2m42s
multiple text file gzipped 4.02 43s
Sequence File, Block, Gzip 5.32 1m18s
Text File, LZO Indexed 7.03 1m22s
Dataiku - Pig, Hive and Cascading
44. Format Integration
} Hive: Serde (Serialize-Deserializer)
} Pig : Storage
} Cascading: Tap
Dataiku - Pig, Hive and Cascading
45. Partitions
Motivation
} No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
} Common partition schemas on Hadoop
◦ By Date /apache_logs/dt=2013-01-23
◦ By Data center /apache_logs/dc=redbus01/…
◦ By Country
◦ …
◦ Or any combination of the above
Dataiku - Pig, Hive and Cascading
46. Hive Partitioning
Partitioned tables
CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1
INSERT
OVERWRITE
TABLE
event
PARTITION(ds='2013-‐01-‐27',
server_id=‘s1’)
SELECT
*
FROM
event_tmp;
Dataiku Training – Hadoop for Data Science 4/14/13 46
47. Cascading Partition
} No Direct support for partition
} Support for “Glob” Tap, to build read from files
using patterns
} è You can code your own custom or virtual
partition schemes
Dataiku - Pig, Hive and Cascading
50. Cascading
Direct Code Evaluation
Dataiku - Pig, Hive and Cascading
51. Integration
Summary
Partition/ External Code Format
Incremental Integration
Updates
Pig No Direct Simple Doable and rich
Support community
Hive Fully integrated, Very simple, but Doable and
SQL Like complex dev setup existing
community
Cascading With Coding Complex UDFS Doable and
but regular, and growing
Java Expression commuinty
embeddable
Dataiku - Pig, Hive and Cascading
52. Comparing without Comparable
} Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
} Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
} Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
} Performance and optimization
Dataiku - Pig, Hive and Cascading
53. Optimization
} Several Common Map Reduce Optimization
Patterns
◦ Combiners
◦ MapJoin
◦ Job Fusion
◦ Job Parallelism
◦ Reducer Parallelism
} Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write
Dataiku - Pig, Hive and Cascading
54. Combiner
Perform Partial Aggregate at Mapper Stage
SELECT
date,
COUNT(*)
FROM
product
GROUP
BY
date
2012-‐02-‐14
4354
Map …
Reduce
2012-‐02-‐14
4354
2012-‐02-‐14
20
2012-‐02-‐15
21we2
…
2012-‐02-‐15
21we2
2012-‐02-‐15
35
2012-‐02-‐14
qa334
…
2012-‐02-‐15
23aq2
2012-‐02-‐14
qa334
2012-‐02-‐16
1
…
2012-‐02-‐15
23aq2
Dataiku - Pig, Hive and Cascading
55. Combiner
Perform Partial Aggregate at Mapper Stage
SELECT
date,
COUNT(*)
FROM
product
GROUP
BY
date
Map Reduce
2012-‐02-‐14
4354
2012-‐02-‐14
8
2012-‐02-‐14
20
…
2012-‐02-‐15
12
2012-‐02-‐15
21we2
2012-‐02-‐15
35
2012-‐02-‐14
qa334
…
2012-‐02-‐15
23aq2
2012-‐02-‐14
12
2012-‐02-‐16
1
2012-‐02-‐15
23
2012-‐02-‐16
1
Reduced network bandwith. Better parallelism
Dataiku - Pig, Hive and Cascading
56. Join Optimization
Map Join
Hive
set hive.auto.convert.join = true;
Pig
Cascading
( no aggregation support after HashJoin)
Dataiku - Pig, Hive and Cascading
57. Number of Reducers
} Critical for performance
} Estimated per the size of input file
◦ Hive
– divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
– divide size pig.exec.reducers.bytes.per.reducer (default 1GB)
Dataiku - Pig, Hive and Cascading
58. Performance & Optimization
Summary
Combiner Join Number of
Optimization Optimization reducers
optimization
Pig Automatic Option Estimate or DIY
Cascading DIY HashJoin DIY
Hive Partial Automatic Estimate or DIY
DIY (Map Join)
Dataiku - Pig, Hive and Cascading
59. Agenda
} Hadoop and Context (->0:03)
} Pig, Hive, Cascading, … (->0:06)
} How they work (->0:09)
} Comparing the tools (->0:25)
} Wrap’up and question (->0:30)
Dataiku - Pig, Hive and Cascading
60. } Want to keep close to SQL ?
◦ Hive
} Want to write large flows ?
◦ Pig
} Want to integrate in large scale programming
projects
◦ Cascading (cascalog / scalding)
Dataiku - Pig, Hive and Cascading