SlideShare uma empresa Scribd logo
1 de 60
Baixar para ler offline
Pig Hive Cascading
              Hadoop In Practice

        }    Devoxx 2013
        }    Florian Douetteau
About me

 Florian Douetteau <florian.douetteau@dataiku.com>




 }    CEO at Dataiku
 }    Freelance at Criteo (Online Ads)
 }    CTO at IsCool Ent. (#1 French Social Gamer)
 }    VP R&D Exalead (Search Engine Technology)

Dataiku Training – Hadoop for Data Science    4/14/13   2
Agenda


                                }    Hadoop and Context (->0:03)
                                }    Pig, Hive, Cascading, … (->0:06)
                                }    How they work (->0:09)
                                }    Comparing the tools (->0:25)
                                }    Wrap’up and question (->0:)




Dataiku - Pig, Hive and Cascading
CHOOSE TECHNOLOGY
NoSQL-Slavia!                            Scalability Central!                Machine Learning !
                                                                             Mystery Land!
Elastic Search
                                         Hadoop                         Scikit-Learn
SOLR                                     Ceph

             MongoDB                     Cassandra
                                                    Sphere              Mahout
                                                                                   WEKA
Riak                                                                    MLBase    LibSVM
                            Membase
                                         Spark
SQL Colunnar Republic!
InfiniDB
                                                               SAS
                                                               RapidMiner
                                                                                              R
            Vertica                                                                    SPSS
                                                                          Panda
GreenPlum                    QlickView      Pig
Impala                       Tableau
                                                                        Statistician Old !
Netezza              SpotFire               Cascading
                                                     Talend             House!
                     HTML5/D3
       Vizualization County!
                                                Data Clean Wasteland!
     Dataiku - Pig, Hive and Cascading
How do I (pre)process data?
 Implicit User Data
 (Views, Searches…)



                                                                                        Online User Information


                                                                   Transformation
                       500TB                                       Predictor
                                               Transformation
                                               Matrix



Explicit User Data                                                                      Predictor Runtime
(Click, Buy, …)

                                               Per User Stats      Rank Predictor



                       50TB

                                               Per Content Stats


User Information
(Location, Graph…)
                                                                   User Similarity

                         1TB


Content Data
(Title, Categories, Price, …)


                       200GB                                       Content Similarity




                                                                                         A/B Test Data


           Dataiku - Pig, Hive and Cascading
Typical Use Case 1

 Web Analytics Processing
}    Analyse Raw Logs
      (Trackers, Web Logs)
}    Extract IP, Page, …
}    Detect and remove
      robots
}    Build Statistics
      ◦  Number of page view, per
         produt
      ◦  Best Referers
      ◦  Traffic Analysis
      ◦  Funnel
      ◦  SEO Analysis
      ◦  …




                                    Dataiku - Pig, Hive and Cascading
Typical Use Case 2

Mining Search Logs for Synonyms
}  Extract Query Logs
}  Perform query
    normalization
}  Compute Ngrams
}  Compute Search
    “Sessions”
}  Compute Log-
    Likehood Ratio for
    ngrams across
    sesions


                         Dataiku - Pig, Hive and Cascading
Typical Use Case 3

Product Recommender
}    Compute User –
      Product Association
      Matrix
}    Compute different
      similarities ratio
      (Ochiai, Cosine, …)
}    Filter out bad
      predictions
}    For each user, select
      best recommendable
      products


                              Dataiku - Pig, Hive and Cascading
Agenda


                                }    Hadoop and Context
                                }    Pig, Hive, Cascading, …
                                }    How they work
                                }    Comparing the tools




Dataiku - Pig, Hive and Cascading
Pig History

  }    Yahoo Research in 2006
  }    Inspired from Sawzall, a Google Paper
        from 2003
  }    2007 as an Apache Project

  }    Initial motivation
        ◦  Search Log Analytics: how long is the
           average user session ? how many links does
           a user click ? on before leaving a website ?
           how do click patterns vary in the course of a
           day/week/month ? …

words = LOAD '/training/hadoop-wordcount/output‘
USING PigStorage(‘t’)
         AS (word:chararray, count:int);

sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;

DUMP first_words;

 Dataiku - Pig, Hive and Cascading
Hive History

 }    Developed by Facebook in January 2007

 }    Open source in August 2008

 }    Initial Motivation
       ◦  Provide a SQL like abstraction to perform
          statistics on status updates


create external table wordcounts (
    word string,
    count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';

select * from wordcounts order by count desc limit
10;

select SUM(count) from wordcounts where word like
‘th%’;
Dataiku - Pig, Hive and Cascading
Cascading History

 }    Authored by Chris Wensel 2008

 }    Associated Projects
       ◦  Cascalog : Cascading in Closure
       ◦  Scalding : Cascading in Scala (Twitter
          in 2012)
       ◦  Lingual ( to be released soon): SQL
          layer on top of cascading




Dataiku - Pig, Hive and Cascading
Agenda


                                }    Hadoop and Context
                                }    Pig, Hive, Cascading, …
                                }    How they work
                                }    Comparing the tools




Dataiku - Pig, Hive and Cascading
MapReduce

 Simplicity is a complexity




Dataiku - Innovation Services   4/14/13   14
Pig & Hive

 Mapping to Mapreduce jobs
 events          = LOAD ‘/events’ USING PigStorage(‘t’) AS
                    (type:chararray, user:chararray, price:int, timestamp:int);
 events_filtered = FILTER events BY type;
 by_user                = GROUP events_filtered BY user;
 price_by_user          = FOREACH by_user GENERATE type, SUM(price) AS total_price,
                                                          MAX(timestamp) as max_ts;
 high_pbu               = FILTER price_by_user BY total_price > 1000;




        Job 1 : Mapper                                              Job 1 : Reducer1
   LOAD                         FILTER                      GROUP      FOREACH         FILTER
                                          Shuffle and 

                                          sort by user




                                                                                 * VAT excluded


Dataiku - Innovation Services                                              4/14/13                15
Pig & Hive

 Mapping to Mapreduce jobs
 events          = LOAD ‘/events’ USING PigStorage(‘t’) AS
                    (type:chararray, user:chararray, price:int, timestamp:int);
 events_filtered = FILTER events BY type;
 by_user                = GROUP events_filtered BY user;
 price_by_user          = FOREACH by_user GENERATE type, SUM(price) AS total_price,
                                                          MAX(timestamp) as max_ts;
 high_pbu               = FILTER price_by_user BY total_price > 1000;
 recent_high            = ORDER high_pbu BY max_ts DESC;
 STORE recent_high INTO ‘/output’;


         Job 1: Mapper                                              Job 1 :Reducer
   LOAD                         FILTER                      GROUP      FOREACH        FILTER
                                          Shuffle and 

                                          sort by user


         Job 2: Mapper                                              Job 2: Reducer
                LOAD

                                           Shuffle and 
                 STORE
             (from tmp)
                                         sort by max_ts


Dataiku - Innovation Services                                              4/14/13             16
Pig 

 How does it work
      Data Execution Plan compiled into 10
      map reduce jobs executed in parallel
      (or not)
               84   TResolution = LOAD '$PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY' USING PigStorage('u0001');
               85   TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId;
               86
               87
               88   TSiteMap = LOAD '$PREFIX/dwh_dim_sitemapnode/dt=$DAY' USING PigStorage('u0001');
               89   TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId;
               90
               91
               92   TCustomer = LOAD '$PREFIX/customer_relation/dt=$DAY' USING PigStorage('u0001')
               93   as (SKCustomerId:chararray,
               94   CustomerId:chararray);
               95
               96   F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, 'yyyy-MM-dd HH:mm:ss'
               97
               98   F2 = FOREACH F1 GENERATE *,
               99   CONCAT(CONCAT(CONCAT(CONCAT(visid_high,'-'), visid_low), '-'), visit_num) as VisitId,
              100   (referrer matches '.*cdiscount.com.*' OR referrer matches 'cdscdn.com' ? NULL :referrer ) as Referrer,
              101   (iso IS NOT NULL ? ISODaysBetween(iso, '1899-12-31T00:00:00') : NULL)
              102        AS SkDateId,
              103   (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL)
              104    AS SkTimeId,
              105   ((event_list is not null and event_list matches '.*b202b.*') ? 'Y' : 'N') as is_202,
              106   ((event_list is not null and event_list matches '.*b10b.*') ? 'Y' : 'N') as is_10,
              107   ((event_list is not null and event_list matches '.*b12b.*') ? 'Y' : 'N') as is_12,
              108   ((event_list is not null and event_list matches '.*b13b.*') ? 'Y' : 'N') as is_13,
              109   ((event_list is not null and event_list matches '.*b14b.*') ? 'Y' : 'N') as is_14,
              110   ((event_list is not null and event_list matches '.*b11b.*') ? 'Y' : 'N') as is_11,
              111   ((event_list is not null and event_list matches '.*b1b.*') ? 'Y' : 'N') as is_1,
              112   REGEX_EXTRACT(pagename, 'F-(.*):.*', 1) AS ProductReferenceId,
              113   NULL AS OriginFile;
              114
              115   SET DEFAULT_PARALLEL 24;
              116
              117   F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING 'replicated' PARALLEL 20 ;
              118   F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? '-1' : SKSearchEngineId) as SKSearchEngineId;
              119   --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId;
              120
              121   F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING 'replicated' PARALLEL 20;
              122   F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? '-1' : SKBrowserId) as SKBrowserId;
              123
              124   --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId;
              125
              126
              127   F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING 'replicated' PARALLEL 20;
              128   F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? '-1' : SKOperatingSystemId) as SKOperatingSystemId;
              129
              130   --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId;
              131
              132
              133   F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING 'replicated' PARALLEL 20;
              134   F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? '-1' : SKResolutionId) as SKResolutionId;
              135
              136   --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId;
              137
              138   F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING 'replicated' PARALLEL 20;
              139   F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? '-1' : SKSimteMapNodeId) as SKSimteMapNodeId;
              140
              141   --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId;
              142
              143
              144   SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL;
              145
              146   F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING 'skewed' PARALLEL 20;
              147   WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
              148
              149   --F8_UNION   = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
              150   F8_UNION =   UNION F8, WITHOUT_CUSTOMER;
              151   --DESCRIBE   F8;
              152   --DESCRIBE   WITHOUT_CUSTOMER;
              153   --DESCRIBE   F8_UNION;
              154
              155   F9 = FOREACH F8_UNION GENERATE
              156   visid_high,
              157   visid_low,
              158   VisitId,

              159   post_evar30,
              160   SKCustomerId,
              161   visit_num,
              162   SkDateId,
              163   SkTimeId,
              164   post_evar16,
              165   post_evar52,
              166   visit_page_num,
              167   is_202,
              168   is_10,
              169   is_12,




Dataiku - Pig, Hive and Cascading
Cascading

 From Code To Jobs




Dataiku - Pig, Hive and Cascading
Hive Joins

     How to join with MapReduce ?

                                         Uid   Tbl_idx   Name     Type
tbl_idx   uid      name                                                         Uid     Name     Type
                                         1     1         Dupont
1         1        Dupont                                                       1       Dupont   Type1
                                         1     2                  Type1
1         2        Durand                                                       1       Dupont   Type2
                                         1     2                  Type2



                   Shuffle by uid
                                                                  Reducer 1
                Sort by (uid, tbl_idx)

tbl_idx   uid      type
                                         Uid   Tbl_idx   Name     Type
2         1        Type1                                                       Uid      Name     Type
                                         2     1         Durand
2         1        Type2                                                       2        Durand   Type1
                                         2     2                  Type1
2         2        Type1


 Mappers output                                                   Reducer 2


    Dataiku - Innovation Services                                             4/14/13                    19
Agenda


                                }    Hadoop and Context
                                }    Pig, Hive, Cascading, …
                                }    How they work
                                }    Comparing the tools




Dataiku - Pig, Hive and Cascading
Comparing without Comparable	
  

 }    Philosophy
       ◦  Procedural Vs Declarative
       ◦  Data Model and Schema
 }    Productivity
       ◦  Headachability
       ◦  Checkpointing
       ◦  Testing and environment
 }    Integration
       ◦  Partitioning
       ◦  Formats Integration
       ◦  External Code Integration
 }    Performance and optimization

Dataiku - Pig, Hive and Cascading
Procedural Vs Declarative

          }      Transformation as a                                                                                               }    Transformation as a
                  sequence of                                                                                                             set of formulas
                  operations
Users	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  load	
  'users'	
  as	
  (name,	
  age,	
  ipaddr);	
  
Clicks	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  load	
  'clicks'	
  as	
  (user,	
  url,	
  value);	
  
ValuableClicks	
  	
  	
  	
  	
  	
  	
  =	
  filter	
  Clicks	
  by	
  value	
  >	
  0;	
                                                insert	
  into	
  ValuableClicksPerDMA	
  select	
  
UserClicks	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  join	
  Users	
  by	
  name,	
  ValuableClicks	
  by	
                          dma,	
  count(*)	
  
user;	
                                                                                                                                    from	
  geoinfo	
  join	
  (	
  	
  
Geoinfo	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  load	
  'geoinfo'	
  as	
  (ipaddr,	
  dma);	
                                          	
  select	
  name,	
  ipaddr	
  from	
  
UserGeo	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  join	
  UserClicks	
  by	
  ipaddr,	
  Geoinfo	
  by	
                 users	
  join	
  clicks	
  on	
  (users.name	
  =	
  
ipaddr;	
                                                                                                                                  clicks.user)	
  
ByDMA	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  group	
  UserGeo	
  by	
  dma;	
                                                  	
  where	
  value	
  >	
  0;	
  
ValuableClicksPerDMA	
  =	
  foreach	
  ByDMA	
  generate	
  group,	
                                                                                       	
  )	
  using	
  ipaddr	
  
COUNT(UserGeo);	
                                                                                                                          group	
  by	
  dma;	
  
store	
  ValuableClicksPerDMA	
  into	
  'ValuableClicksPerDMA';	
  




                                                                                                                           Dataiku - Pig, Hive and Cascading
Data type and Model

 Rationale
 }    All three Extend basic data model with extended
       data types
       ◦  array-like [ event1, event2, event3]
       ◦  map-like { type1:value1, type2:value2, …}

 }    Different approach
       ◦  Resilient Schema
       ◦  Static Typing
       ◦  No Static Typing




Dataiku - Pig, Hive and Cascading
Hive

 Data Type and Schema
  CREATE TABLE visit (
            user_name             STRING,
            user_id              INT,
            user_details         STRUCT<age:INT, zipcode:INT>
  );

                      Simple type                                      Details
  TINYINT, SMALLINT, INT, BIGINT                       1, 2, 4 and 8 bytes
  FLOAT, DOUBLE                                        4 and 8 bytes
  BOOLEAN

  STRING                                               Arbitrary-length, replaces VARCHAR
  TIMESTAMP

                    Complex type                                       Details
  ARRAY                                                Array of typed items (0-indexed)
  MAP                                                  Associative map
  STRUCT                                               Complex class-like objects

Dataiku Training – Hadoop for Data Science                                       4/14/13    24
Data types and Schema

 Pig

 rel = LOAD '/folder/path/'
      USING PigStorage(‘t’)
      AS (col:type, col:type, col:type);


         Simple type                                             Details
  int, long, float,                   32 and 64 bits, signed
  double
  chararray                           A string
  bytearray                           An array of … bytes
  boolean                             A boolean

        Complex type                                             Details
  tuple                               a tuple is an ordered fieldname:value map
  bag                                 a bag is a set of tuples


Dataiku Training – Hadoop for Data Science                                 4/14/13   25
Data Type and Schema 

 Cascading
 }    Support for Any Java Types, provided they can be
       serialized in Hadoop
 }    No support for Typing
         Simple type                                         Details
  Int, Long, Float,                 32 and 64 bits, signed
  Double
  String                            A string
  byte[]                            An array of … bytes
  Boolean                           A boolean

        Complex type                                         Details
  Object                            Object must be « Hadoop serializable »




Dataiku - Pig, Hive and Cascading
Style Summary


                         Style           Typing        Data Model     Metadata
                                                                      store
Pig                         Procedural     Static +      scalar +         No
                                           Dynamic     tuple+ bag      (HCatalog)
                                                          (fully
                                                        recursive)
Hive                       Declarative     Static +    scalar+ list    Integrated
                                          Dynamic,       + map
                                         enforced at
                                          execution
                                             time
Cascading                   Procedural      Weak       scalar+ java       No
                                                         objects



Dataiku - Pig, Hive and Cascading
Comparing without Comparable	
  

 }    Philosophy
       ◦  Procedural Vs Declarative
       ◦  Data Model and Schema
 }    Productivity
       ◦  Headachability
       ◦  Checkpointing
       ◦  Testing, error management and environment
 }    Integration
       ◦  Partitioning
       ◦  Formats Integration
       ◦  External Code Integration
 }    Performance and optimization

Dataiku - Pig, Hive and Cascading
Headachility

Motivation
}    Does debugging
      the tool lead to bad
      headaches ?




                             Dataiku - Pig, Hive and Cascading
Headaches

Pig
}    Out Of Memory Error (Reducer)

}    Exception in Building /
      Extended Functions 

      (handling of null)

}    Null vs “”

}    Nested Foreach and scoping

}    Date Management (pig 0.10)

}    Field implicit ordering




                                      Dataiku - Pig, Hive and Cascading
A Pig Error




Dataiku - Pig, Hive and Cascading
Headaches

Hive
}    Out of Memory Errors in
      Reducers

}    Few Debugging Options

}    Null / “”

}    No builtin “first”




                                Dataiku - Pig, Hive and Cascading
Headaches

Cascading
}    Weak Typing Errors (comparing
      Int and String … )

}    Illegal Operation Sequence
      (Group after group …)

}    Field Implicit Ordering




                                      Dataiku - Pig, Hive and Cascading
Testing

 Motivation
 }    How to perform unit tests ?
 }    How to have different versions of the same script
       (parameter) ?




Dataiku - Pig, Hive and Cascading
Testing

 Pig
 }    System Variables
 }    Comment to test
 }    No Meta Programming
 }    pig –x local to execute on local files




Dataiku - Pig, Hive and Cascading
Testing / Environment 

 Cascading
 }    Junit Tests are possible
 }    Ability to use code to actually comment out some
       variables




Dataiku - Pig, Hive and Cascading
Checkpointing 

 Motivation
 }    Lots of iteration while developing on Hadoop
 }    Sometime jobs fail
 }    Sometimes need to restart from the start …




 Parse Logs                  Per Page Stats    Page User Correlation   Filtering   Output




                                              FIX and relaunch



Dataiku - Pig, Hive and Cascading
Pig

 Manual Checkpointing
 }    STORE Command to manually 

       store files




 Parse Logs                  Per Page Stats   Page User Correlation   Filtering   Output




               // COMMENT Beginning
               of script and relaunch

Dataiku - Pig, Hive and Cascading
Cascading 

Automated Checkpointing
}    Ability to re-run a
      flow automatically
      from the last saved
      checkpoint




        addCheckpoint(…)	
  



                               Dataiku - Pig, Hive and Cascading
Cascading 

Topological Scheduler
}  Check each file intermediate timestamp
}  Execute only if more recent




Parse Logs   Per Page Stats   Page User Correlation         Filtering     Output




                                      Dataiku - Pig, Hive and Cascading
Productivity Summary

                         Headaches    Checkpointing/          Testing /
                                          Replay         Metaprogrammation

Pig                           Lots     Manual Save            Difficult


Hive                      Few, but   None (That’s SQL)    None (That’s SQL)
                          without
                         debugging
                          options
Cascading              Weak Typing    Checkpointing           Possible
                       Complexity     Partial Updates




 Dataiku - Pig, Hive and Cascading
Comparing without Comparable	
  

 }    Philosophy
       ◦  Procedural Vs Declarative
       ◦  Data Model and Schema
 }    Productivity
       ◦  Headachability
       ◦  Checkpointing
       ◦  Testing and environment
 }    Integration
       ◦  Formats Integration
       ◦  Partitioning
       ◦  External Code Integration
 }    Performance and optimization

Dataiku - Pig, Hive and Cascading
Formats Integration

 Motivation
 }    Ability to integrate different file formats
       ◦  Text Delimited
       ◦  Sequence File (Binary Hadoop format)
       ◦  Avro, Thrift ..
 }    Ability to integrate with external data sources or
       sink ( MongoDB, ElasticSearch, Database. …)

       Format impact on size and performance

  Format                            Size on Disk (GB)   HIVE Processing time (24 cores)

  Text File, uncompressed           18.7                1m32s

  1 Text File, Gzipped              3.89                6m23s
                                                        (no parallelization)

  JSON compressed                   7.89                2m42s

  multiple text file gzipped        4.02                43s

  Sequence File, Block, Gzip        5.32                1m18s

  Text File, LZO Indexed            7.03                1m22s


Dataiku - Pig, Hive and Cascading
Format Integration



 }    Hive: Serde (Serialize-Deserializer)
 }    Pig : Storage
 }    Cascading: Tap




Dataiku - Pig, Hive and Cascading
Partitions

 Motivation
 }    No support for “UPDATE” patterns, any increment is
       performed by adding or deleting a partition
 }    Common partition schemas on Hadoop
       ◦    By Date /apache_logs/dt=2013-01-23
       ◦    By Data center /apache_logs/dc=redbus01/…
       ◦    By Country
       ◦    …
       ◦    Or any combination of the above




Dataiku - Pig, Hive and Cascading
Hive Partitioning

 Partitioned tables
 CREATE TABLE event (
      user_id INT,
      type STRING,
      message STRING)
 PARTITIONED BY (day STRING, server_id STRING);
Disk structure

/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1


INSERT	
  OVERWRITE	
  TABLE	
  	
  event	
  PARTITION(ds='2013-­‐01-­‐27',	
  
server_id=‘s1’)	
  
SELECT	
  *	
  FROM	
  event_tmp;	
  
Dataiku Training – Hadoop for Data Science                           4/14/13      46
Cascading Partition

 }    No Direct support for partition
 }    Support for “Glob” Tap, to build read from files
       using patterns



 }    è You can code your own custom or virtual
       partition schemes




Dataiku - Pig, Hive and Cascading
External Code Integration

 Simple UDF
               Pig                  Hive




                                     Cascading




Dataiku - Pig, Hive and Cascading
Hive Complex UDF

 (Aggregators)




Dataiku - Pig, Hive and Cascading
Cascading 

 Direct Code Evaluation




Dataiku - Pig, Hive and Cascading
Integration

  Summary

                               Partition/            External Code       Format
                               Incremental                               Integration
                               Updates
Pig                                   No Direct           Simple         Doable and rich
                                       Support                             community
Hive                             Fully integrated,    Very simple, but      Doable and
                                     SQL Like        complex dev setup       existing
                                                                            community
Cascading                            With Coding      Complex UDFS          Doable and
                                                     but regular, and        growing
                                                     Java Expression        commuinty
                                                       embeddable




 Dataiku - Pig, Hive and Cascading
Comparing without Comparable	
  

 }    Philosophy
       ◦  Procedural Vs Declarative
       ◦  Data Model and Schema
 }    Productivity
       ◦  Headachability
       ◦  Checkpointing
       ◦  Testing and environment
 }    Integration
       ◦  Formats Integration
       ◦  Partitioning
       ◦  External Code Integration
 }    Performance and optimization

Dataiku - Pig, Hive and Cascading
Optimization

 }    Several Common Map Reduce Optimization
       Patterns
       ◦    Combiners
       ◦    MapJoin
       ◦    Job Fusion
       ◦    Job Parallelism
       ◦    Reducer Parallelism
 }    Different support per framework
       ◦  Fully Automatic
       ◦  Pragma / Directives / Options
       ◦  Coding style / Code to write




Dataiku - Pig, Hive and Cascading
Combiner

         Perform Partial Aggregate at Mapper Stage

         SELECT	
  date,	
  COUNT(*)	
  FROM	
  product	
  GROUP	
  BY	
  date	
  

                                                   2012-­‐02-­‐14	
  4354	
  
                                   Map             …	
                           Reduce
2012-­‐02-­‐14	
  4354	
                                                                  2012-­‐02-­‐14	
  20	
  
                                                   2012-­‐02-­‐15	
  21we2	
  
…	
                                                	
  
2012-­‐02-­‐15	
  21we2	
  
	
  
                                                                                          2012-­‐02-­‐15	
  35	
  

2012-­‐02-­‐14	
  qa334	
  
…	
  
2012-­‐02-­‐15	
  23aq2	
  
                                            2012-­‐02-­‐14	
  qa334	
  
	
                                                                                        2012-­‐02-­‐16	
  1	
  
                                            …	
  
                                            2012-­‐02-­‐15	
  23aq2	
  
                                            	
  




        Dataiku - Pig, Hive and Cascading
Combiner

         Perform Partial Aggregate at Mapper Stage

         SELECT	
  date,	
  COUNT(*)	
  FROM	
  product	
  GROUP	
  BY	
  date	
  


                                   Map                                    Reduce
2012-­‐02-­‐14	
  4354	
                       2012-­‐02-­‐14	
  8	
                 2012-­‐02-­‐14	
  20	
  
…	
                                            2012-­‐02-­‐15	
  12	
  
2012-­‐02-­‐15	
  21we2	
                      	
  
	
                                             	
  
                                                                                     2012-­‐02-­‐15	
  35	
  

2012-­‐02-­‐14	
  qa334	
  
…	
  
2012-­‐02-­‐15	
  23aq2	
  
	
                                          2012-­‐02-­‐14	
  12	
                   2012-­‐02-­‐16	
  1	
  
                                            2012-­‐02-­‐15	
  23	
  
                                            2012-­‐02-­‐16	
  1	
  
                                            	
  
                                            	
  
                                            	
  

                              Reduced network bandwith. Better parallelism


        Dataiku - Pig, Hive and Cascading
Join Optimization

 Map Join

                                    Hive
                                    set hive.auto.convert.join = true;
                                    Pig




                                    Cascading




                                           ( no aggregation support after HashJoin)




Dataiku - Pig, Hive and Cascading
Number of Reducers

 }    Critical for performance


 }    Estimated per the size of input file
       ◦  Hive
          –  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
       ◦  Pig
          –  divide size pig.exec.reducers.bytes.per.reducer (default 1GB)




Dataiku - Pig, Hive and Cascading
Performance & Optimization 

 Summary


                                Combiner        Join            Number of
                                Optimization    Optimization    reducers
                                                                optimization


Pig                                 Automatic       Option      Estimate or DIY
Cascading                             DIY          HashJoin           DIY
Hive                                 Partial       Automatic    Estimate or DIY
                                      DIY          (Map Join)




Dataiku - Pig, Hive and Cascading
Agenda


                                }    Hadoop and Context (->0:03)
                                }    Pig, Hive, Cascading, … (->0:06)
                                }    How they work (->0:09)
                                }    Comparing the tools (->0:25)
                                }    Wrap’up and question (->0:30)




Dataiku - Pig, Hive and Cascading
}    Want to keep close to SQL ?
       ◦  Hive
 }    Want to write large flows ?
       ◦  Pig
 }    Want to integrate in large scale programming
       projects
       ◦  Cascading (cascalog / scalding)




Dataiku - Pig, Hive and Cascading

Mais conteúdo relacionado

Mais procurados

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
devopsdays Warsaw 2018 - Chaos while deploying ML
devopsdays Warsaw 2018 - Chaos while deploying MLdevopsdays Warsaw 2018 - Chaos while deploying ML
devopsdays Warsaw 2018 - Chaos while deploying MLThiago de Faria
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Rehgan Avon
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
2011 march cloud computing atlanta
2011 march cloud computing atlanta2011 march cloud computing atlanta
2011 march cloud computing atlantaChristopher Curtin
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong
 
The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !Christian Bilien
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixData Con LA
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Sri Ambati
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 

Mais procurados (20)

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
devopsdays Warsaw 2018 - Chaos while deploying ML
devopsdays Warsaw 2018 - Chaos while deploying MLdevopsdays Warsaw 2018 - Chaos while deploying ML
devopsdays Warsaw 2018 - Chaos while deploying ML
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
2011 march cloud computing atlanta
2011 march cloud computing atlanta2011 march cloud computing atlanta
2011 march cloud computing atlanta
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !The DBA Is Dead (Again). Long Live the DBA !
The DBA Is Dead (Again). Long Live the DBA !
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 

Destaque

Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...SolarWinds Loggly
 
Silicon Valley Grade IT and Cloud Maturity Assessment for Startup Ecosystem i...
Silicon Valley Grade IT and Cloud Maturity Assessment for Startup Ecosystem i...Silicon Valley Grade IT and Cloud Maturity Assessment for Startup Ecosystem i...
Silicon Valley Grade IT and Cloud Maturity Assessment for Startup Ecosystem i...Engin Deveci, Ph.D.
 
Running Business Critical Workloads on AWS
Running Business Critical Workloads on AWS Running Business Critical Workloads on AWS
Running Business Critical Workloads on AWS Amazon Web Services
 
Stephenson big data utrecht 2017
Stephenson   big data utrecht 2017Stephenson   big data utrecht 2017
Stephenson big data utrecht 2017BigDataExpo
 
Dino Product Overview
Dino Product OverviewDino Product Overview
Dino Product OverviewPim Brokken
 
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...CA Technologies
 
D5 crazy speed web development
D5 crazy speed web developmentD5 crazy speed web development
D5 crazy speed web developmentNAVER D2
 
GoAzure 2015 Azure AD for Developers
GoAzure 2015 Azure AD for DevelopersGoAzure 2015 Azure AD for Developers
GoAzure 2015 Azure AD for Developerskekekekenta
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBigDataExpo
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
 
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3Holger Mueller
 
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPraktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPrimend
 
Cyberlaw and Cybercrime
Cyberlaw and CybercrimeCyberlaw and Cybercrime
Cyberlaw and CybercrimePravir Karna
 
Fontys eric van tol
Fontys eric van tolFontys eric van tol
Fontys eric van tolBigDataExpo
 
Science ABC Book
Science ABC BookScience ABC Book
Science ABC Booktjelk1
 

Destaque (20)

Rb wilmer peres
Rb wilmer peresRb wilmer peres
Rb wilmer peres
 
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...
 
Fun git hub
Fun git hubFun git hub
Fun git hub
 
Silicon Valley Grade IT and Cloud Maturity Assessment for Startup Ecosystem i...
Silicon Valley Grade IT and Cloud Maturity Assessment for Startup Ecosystem i...Silicon Valley Grade IT and Cloud Maturity Assessment for Startup Ecosystem i...
Silicon Valley Grade IT and Cloud Maturity Assessment for Startup Ecosystem i...
 
Running Business Critical Workloads on AWS
Running Business Critical Workloads on AWS Running Business Critical Workloads on AWS
Running Business Critical Workloads on AWS
 
Stephenson big data utrecht 2017
Stephenson   big data utrecht 2017Stephenson   big data utrecht 2017
Stephenson big data utrecht 2017
 
Dino Product Overview
Dino Product OverviewDino Product Overview
Dino Product Overview
 
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
Pre-Con Ed: Discover the New CA App Experience Analytics 16.3 - The Omnichann...
 
D5 crazy speed web development
D5 crazy speed web developmentD5 crazy speed web development
D5 crazy speed web development
 
GoAzure 2015 Azure AD for Developers
GoAzure 2015 Azure AD for DevelopersGoAzure 2015 Azure AD for Developers
GoAzure 2015 Azure AD for Developers
 
Oracle Cloud Café IOT 12 avril 2016
Oracle Cloud Café IOT 12 avril 2016Oracle Cloud Café IOT 12 avril 2016
Oracle Cloud Café IOT 12 avril 2016
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
Oracle OpenWorld - A quick take on all 22 press releases of Day #1 - #3
 
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPraktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
 
Cyberlaw and Cybercrime
Cyberlaw and CybercrimeCyberlaw and Cybercrime
Cyberlaw and Cybercrime
 
Fontys eric van tol
Fontys eric van tolFontys eric van tol
Fontys eric van tol
 
Science ABC Book
Science ABC BookScience ABC Book
Science ABC Book
 
EventoDadosAbertos v17ago16
EventoDadosAbertos v17ago16EventoDadosAbertos v17ago16
EventoDadosAbertos v17ago16
 
Andreas weigend
Andreas weigendAndreas weigend
Andreas weigend
 

Semelhante a Dataiku pig - hive - cascading

Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
8 douetteau - dataiku - data tuesday open source 26 fev 2013
8   douetteau - dataiku - data tuesday open source 26 fev 2013 8   douetteau - dataiku - data tuesday open source 26 fev 2013
8 douetteau - dataiku - data tuesday open source 26 fev 2013 Data Tuesday
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache SparkYasoda Jayaweera
 
Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data cloudsdamienjoyce
 
2011 - TDWI Big Data Forum - The New Analytics
2011 - TDWI Big Data Forum - The New Analytics 2011 - TDWI Big Data Forum - The New Analytics
2011 - TDWI Big Data Forum - The New Analytics Casey Kiernan
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Stormboorad
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesData Ninja API
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
 

Semelhante a Dataiku pig - hive - cascading (20)

Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
8 douetteau - dataiku - data tuesday open source 26 fev 2013
8   douetteau - dataiku - data tuesday open source 26 fev 2013 8   douetteau - dataiku - data tuesday open source 26 fev 2013
8 douetteau - dataiku - data tuesday open source 26 fev 2013
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Big Data on the Cloud
Big Data on the CloudBig Data on the Cloud
Big Data on the Cloud
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data clouds
 
2011 - TDWI Big Data Forum - The New Analytics
2011 - TDWI Big Data Forum - The New Analytics 2011 - TDWI Big Data Forum - The New Analytics
2011 - TDWI Big Data Forum - The New Analytics
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1)
 

Mais de Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 

Mais de Dataiku (20)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 

Dataiku pig - hive - cascading

  • 1. Pig Hive Cascading Hadoop In Practice }  Devoxx 2013 }  Florian Douetteau
  • 2. About me Florian Douetteau <florian.douetteau@dataiku.com> }  CEO at Dataiku }  Freelance at Criteo (Online Ads) }  CTO at IsCool Ent. (#1 French Social Gamer) }  VP R&D Exalead (Search Engine Technology) Dataiku Training – Hadoop for Data Science 4/14/13 2
  • 3. Agenda }  Hadoop and Context (->0:03) }  Pig, Hive, Cascading, … (->0:06) }  How they work (->0:09) }  Comparing the tools (->0:25) }  Wrap’up and question (->0:) Dataiku - Pig, Hive and Cascading
  • 4. CHOOSE TECHNOLOGY NoSQL-Slavia! Scalability Central! Machine Learning ! Mystery Land! Elastic Search Hadoop Scikit-Learn SOLR Ceph MongoDB Cassandra Sphere Mahout WEKA Riak MLBase LibSVM Membase Spark SQL Colunnar Republic! InfiniDB SAS RapidMiner R Vertica SPSS Panda GreenPlum QlickView Pig Impala Tableau Statistician Old ! Netezza SpotFire Cascading Talend House! HTML5/D3 Vizualization County! Data Clean Wasteland! Dataiku - Pig, Hive and Cascading
  • 5. How do I (pre)process data? Implicit User Data (Views, Searches…) Online User Information Transformation 500TB Predictor Transformation Matrix Explicit User Data Predictor Runtime (Click, Buy, …) Per User Stats Rank Predictor 50TB Per Content Stats User Information (Location, Graph…) User Similarity 1TB Content Data (Title, Categories, Price, …) 200GB Content Similarity A/B Test Data Dataiku - Pig, Hive and Cascading
  • 6. Typical Use Case 1
 Web Analytics Processing }  Analyse Raw Logs (Trackers, Web Logs) }  Extract IP, Page, … }  Detect and remove robots }  Build Statistics ◦  Number of page view, per produt ◦  Best Referers ◦  Traffic Analysis ◦  Funnel ◦  SEO Analysis ◦  … Dataiku - Pig, Hive and Cascading
  • 7. Typical Use Case 2
 Mining Search Logs for Synonyms }  Extract Query Logs }  Perform query normalization }  Compute Ngrams }  Compute Search “Sessions” }  Compute Log- Likehood Ratio for ngrams across sesions Dataiku - Pig, Hive and Cascading
  • 8. Typical Use Case 3
 Product Recommender }  Compute User – Product Association Matrix }  Compute different similarities ratio (Ochiai, Cosine, …) }  Filter out bad predictions }  For each user, select best recommendable products Dataiku - Pig, Hive and Cascading
  • 9. Agenda }  Hadoop and Context }  Pig, Hive, Cascading, … }  How they work }  Comparing the tools Dataiku - Pig, Hive and Cascading
  • 10. Pig History }  Yahoo Research in 2006 }  Inspired from Sawzall, a Google Paper from 2003 }  2007 as an Apache Project }  Initial motivation ◦  Search Log Analytics: how long is the average user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? … words = LOAD '/training/hadoop-wordcount/output‘ USING PigStorage(‘t’) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words; Dataiku - Pig, Hive and Cascading
  • 11. Hive History }  Developed by Facebook in January 2007 }  Open source in August 2008 }  Initial Motivation ◦  Provide a SQL like abstraction to perform statistics on status updates create external table wordcounts ( word string, count int ) row format delimited fields terminated by 't' location '/training/hadoop-wordcount/output'; select * from wordcounts order by count desc limit 10; select SUM(count) from wordcounts where word like ‘th%’; Dataiku - Pig, Hive and Cascading
  • 12. Cascading History }  Authored by Chris Wensel 2008 }  Associated Projects ◦  Cascalog : Cascading in Closure ◦  Scalding : Cascading in Scala (Twitter in 2012) ◦  Lingual ( to be released soon): SQL layer on top of cascading Dataiku - Pig, Hive and Cascading
  • 13. Agenda }  Hadoop and Context }  Pig, Hive, Cascading, … }  How they work }  Comparing the tools Dataiku - Pig, Hive and Cascading
  • 14. MapReduce
 Simplicity is a complexity Dataiku - Innovation Services 4/14/13 14
  • 15. Pig & Hive
 Mapping to Mapreduce jobs events = LOAD ‘/events’ USING PigStorage(‘t’) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price > 1000; Job 1 : Mapper Job 1 : Reducer1 LOAD FILTER GROUP FOREACH FILTER Shuffle and 
 sort by user * VAT excluded Dataiku - Innovation Services 4/14/13 15
  • 16. Pig & Hive
 Mapping to Mapreduce jobs events = LOAD ‘/events’ USING PigStorage(‘t’) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price > 1000; recent_high = ORDER high_pbu BY max_ts DESC; STORE recent_high INTO ‘/output’; Job 1: Mapper Job 1 :Reducer LOAD FILTER GROUP FOREACH FILTER Shuffle and 
 sort by user Job 2: Mapper Job 2: Reducer LOAD
 Shuffle and 
 STORE (from tmp) sort by max_ts Dataiku - Innovation Services 4/14/13 16
  • 17. Pig 
 How does it work Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not) 84 TResolution = LOAD '$PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY' USING PigStorage('u0001'); 85 TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId; 86 87 88 TSiteMap = LOAD '$PREFIX/dwh_dim_sitemapnode/dt=$DAY' USING PigStorage('u0001'); 89 TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId; 90 91 92 TCustomer = LOAD '$PREFIX/customer_relation/dt=$DAY' USING PigStorage('u0001') 93 as (SKCustomerId:chararray, 94 CustomerId:chararray); 95 96 F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, 'yyyy-MM-dd HH:mm:ss' 97 98 F2 = FOREACH F1 GENERATE *, 99 CONCAT(CONCAT(CONCAT(CONCAT(visid_high,'-'), visid_low), '-'), visit_num) as VisitId, 100 (referrer matches '.*cdiscount.com.*' OR referrer matches 'cdscdn.com' ? NULL :referrer ) as Referrer, 101 (iso IS NOT NULL ? ISODaysBetween(iso, '1899-12-31T00:00:00') : NULL) 102 AS SkDateId, 103 (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL) 104 AS SkTimeId, 105 ((event_list is not null and event_list matches '.*b202b.*') ? 'Y' : 'N') as is_202, 106 ((event_list is not null and event_list matches '.*b10b.*') ? 'Y' : 'N') as is_10, 107 ((event_list is not null and event_list matches '.*b12b.*') ? 'Y' : 'N') as is_12, 108 ((event_list is not null and event_list matches '.*b13b.*') ? 'Y' : 'N') as is_13, 109 ((event_list is not null and event_list matches '.*b14b.*') ? 'Y' : 'N') as is_14, 110 ((event_list is not null and event_list matches '.*b11b.*') ? 'Y' : 'N') as is_11, 111 ((event_list is not null and event_list matches '.*b1b.*') ? 'Y' : 'N') as is_1, 112 REGEX_EXTRACT(pagename, 'F-(.*):.*', 1) AS ProductReferenceId, 113 NULL AS OriginFile; 114 115 SET DEFAULT_PARALLEL 24; 116 117 F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING 'replicated' PARALLEL 20 ; 118 F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? '-1' : SKSearchEngineId) as SKSearchEngineId; 119 --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId; 120 121 F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING 'replicated' PARALLEL 20; 122 F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? '-1' : SKBrowserId) as SKBrowserId; 123 124 --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId; 125 126 127 F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING 'replicated' PARALLEL 20; 128 F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? '-1' : SKOperatingSystemId) as SKOperatingSystemId; 129 130 --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId; 131 132 133 F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING 'replicated' PARALLEL 20; 134 F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? '-1' : SKResolutionId) as SKResolutionId; 135 136 --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId; 137 138 F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING 'replicated' PARALLEL 20; 139 F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? '-1' : SKSimteMapNodeId) as SKSimteMapNodeId; 140 141 --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId; 142 143 144 SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL; 145 146 F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING 'skewed' PARALLEL 20; 147 WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId; 148 149 --F8_UNION = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId; 150 F8_UNION = UNION F8, WITHOUT_CUSTOMER; 151 --DESCRIBE F8; 152 --DESCRIBE WITHOUT_CUSTOMER; 153 --DESCRIBE F8_UNION; 154 155 F9 = FOREACH F8_UNION GENERATE 156 visid_high, 157 visid_low, 158 VisitId, 159 post_evar30, 160 SKCustomerId, 161 visit_num, 162 SkDateId, 163 SkTimeId, 164 post_evar16, 165 post_evar52, 166 visit_page_num, 167 is_202, 168 is_10, 169 is_12, Dataiku - Pig, Hive and Cascading
  • 18. Cascading
 From Code To Jobs Dataiku - Pig, Hive and Cascading
  • 19. Hive Joins
 How to join with MapReduce ? Uid Tbl_idx Name Type tbl_idx uid name Uid Name Type 1 1 Dupont 1 1 Dupont 1 Dupont Type1 1 2 Type1 1 2 Durand 1 Dupont Type2 1 2 Type2 Shuffle by uid Reducer 1 Sort by (uid, tbl_idx) tbl_idx uid type Uid Tbl_idx Name Type 2 1 Type1 Uid Name Type 2 1 Durand 2 1 Type2 2 Durand Type1 2 2 Type1 2 2 Type1 Mappers output Reducer 2 Dataiku - Innovation Services 4/14/13 19
  • 20. Agenda }  Hadoop and Context }  Pig, Hive, Cascading, … }  How they work }  Comparing the tools Dataiku - Pig, Hive and Cascading
  • 21. Comparing without Comparable   }  Philosophy ◦  Procedural Vs Declarative ◦  Data Model and Schema }  Productivity ◦  Headachability ◦  Checkpointing ◦  Testing and environment }  Integration ◦  Partitioning ◦  Formats Integration ◦  External Code Integration }  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 22. Procedural Vs Declarative }  Transformation as a }  Transformation as a sequence of set of formulas operations Users                                =  load  'users'  as  (name,  age,  ipaddr);   Clicks                              =  load  'clicks'  as  (user,  url,  value);   ValuableClicks              =  filter  Clicks  by  value  >  0;   insert  into  ValuableClicksPerDMA  select   UserClicks                      =  join  Users  by  name,  ValuableClicks  by   dma,  count(*)   user;   from  geoinfo  join  (     Geoinfo                            =  load  'geoinfo'  as  (ipaddr,  dma);    select  name,  ipaddr  from   UserGeo                            =  join  UserClicks  by  ipaddr,  Geoinfo  by   users  join  clicks  on  (users.name  =   ipaddr;   clicks.user)   ByDMA                                =  group  UserGeo  by  dma;    where  value  >  0;   ValuableClicksPerDMA  =  foreach  ByDMA  generate  group,    )  using  ipaddr   COUNT(UserGeo);   group  by  dma;   store  ValuableClicksPerDMA  into  'ValuableClicksPerDMA';   Dataiku - Pig, Hive and Cascading
  • 23. Data type and Model
 Rationale }  All three Extend basic data model with extended data types ◦  array-like [ event1, event2, event3] ◦  map-like { type1:value1, type2:value2, …} }  Different approach ◦  Resilient Schema ◦  Static Typing ◦  No Static Typing Dataiku - Pig, Hive and Cascading
  • 24. Hive
 Data Type and Schema CREATE TABLE visit ( user_name STRING, user_id INT, user_details STRUCT<age:INT, zipcode:INT> ); Simple type Details TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes FLOAT, DOUBLE 4 and 8 bytes BOOLEAN STRING Arbitrary-length, replaces VARCHAR TIMESTAMP Complex type Details ARRAY Array of typed items (0-indexed) MAP Associative map STRUCT Complex class-like objects Dataiku Training – Hadoop for Data Science 4/14/13 24
  • 25. Data types and Schema
 Pig rel = LOAD '/folder/path/' USING PigStorage(‘t’) AS (col:type, col:type, col:type); Simple type Details int, long, float, 32 and 64 bits, signed double chararray A string bytearray An array of … bytes boolean A boolean Complex type Details tuple a tuple is an ordered fieldname:value map bag a bag is a set of tuples Dataiku Training – Hadoop for Data Science 4/14/13 25
  • 26. Data Type and Schema 
 Cascading }  Support for Any Java Types, provided they can be serialized in Hadoop }  No support for Typing Simple type Details Int, Long, Float, 32 and 64 bits, signed Double String A string byte[] An array of … bytes Boolean A boolean Complex type Details Object Object must be « Hadoop serializable » Dataiku - Pig, Hive and Cascading
  • 27. Style Summary Style Typing Data Model Metadata store Pig Procedural Static + scalar + No Dynamic tuple+ bag (HCatalog) (fully recursive) Hive Declarative Static + scalar+ list Integrated Dynamic, + map enforced at execution time Cascading Procedural Weak scalar+ java No objects Dataiku - Pig, Hive and Cascading
  • 28. Comparing without Comparable   }  Philosophy ◦  Procedural Vs Declarative ◦  Data Model and Schema }  Productivity ◦  Headachability ◦  Checkpointing ◦  Testing, error management and environment }  Integration ◦  Partitioning ◦  Formats Integration ◦  External Code Integration }  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 29. Headachility
 Motivation }  Does debugging the tool lead to bad headaches ? Dataiku - Pig, Hive and Cascading
  • 30. Headaches
 Pig }  Out Of Memory Error (Reducer) }  Exception in Building / Extended Functions 
 (handling of null) }  Null vs “” }  Nested Foreach and scoping }  Date Management (pig 0.10) }  Field implicit ordering Dataiku - Pig, Hive and Cascading
  • 31. A Pig Error Dataiku - Pig, Hive and Cascading
  • 32. Headaches
 Hive }  Out of Memory Errors in Reducers }  Few Debugging Options }  Null / “” }  No builtin “first” Dataiku - Pig, Hive and Cascading
  • 33. Headaches
 Cascading }  Weak Typing Errors (comparing Int and String … ) }  Illegal Operation Sequence (Group after group …) }  Field Implicit Ordering Dataiku - Pig, Hive and Cascading
  • 34. Testing
 Motivation }  How to perform unit tests ? }  How to have different versions of the same script (parameter) ? Dataiku - Pig, Hive and Cascading
  • 35. Testing
 Pig }  System Variables }  Comment to test }  No Meta Programming }  pig –x local to execute on local files Dataiku - Pig, Hive and Cascading
  • 36. Testing / Environment 
 Cascading }  Junit Tests are possible }  Ability to use code to actually comment out some variables Dataiku - Pig, Hive and Cascading
  • 37. Checkpointing 
 Motivation }  Lots of iteration while developing on Hadoop }  Sometime jobs fail }  Sometimes need to restart from the start … Parse Logs Per Page Stats Page User Correlation Filtering Output FIX and relaunch Dataiku - Pig, Hive and Cascading
  • 38. Pig
 Manual Checkpointing }  STORE Command to manually 
 store files Parse Logs Per Page Stats Page User Correlation Filtering Output // COMMENT Beginning of script and relaunch Dataiku - Pig, Hive and Cascading
  • 39. Cascading 
 Automated Checkpointing }  Ability to re-run a flow automatically from the last saved checkpoint addCheckpoint(…)   Dataiku - Pig, Hive and Cascading
  • 40. Cascading 
 Topological Scheduler }  Check each file intermediate timestamp }  Execute only if more recent Parse Logs Per Page Stats Page User Correlation Filtering Output Dataiku - Pig, Hive and Cascading
  • 41. Productivity Summary Headaches Checkpointing/ Testing / Replay Metaprogrammation Pig Lots Manual Save Difficult Hive Few, but None (That’s SQL) None (That’s SQL) without debugging options Cascading Weak Typing Checkpointing Possible Complexity Partial Updates Dataiku - Pig, Hive and Cascading
  • 42. Comparing without Comparable   }  Philosophy ◦  Procedural Vs Declarative ◦  Data Model and Schema }  Productivity ◦  Headachability ◦  Checkpointing ◦  Testing and environment }  Integration ◦  Formats Integration ◦  Partitioning ◦  External Code Integration }  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 43. Formats Integration
 Motivation }  Ability to integrate different file formats ◦  Text Delimited ◦  Sequence File (Binary Hadoop format) ◦  Avro, Thrift .. }  Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …) Format impact on size and performance Format Size on Disk (GB) HIVE Processing time (24 cores) Text File, uncompressed 18.7 1m32s 1 Text File, Gzipped 3.89 6m23s (no parallelization) JSON compressed 7.89 2m42s multiple text file gzipped 4.02 43s Sequence File, Block, Gzip 5.32 1m18s Text File, LZO Indexed 7.03 1m22s Dataiku - Pig, Hive and Cascading
  • 44. Format Integration
 }  Hive: Serde (Serialize-Deserializer) }  Pig : Storage }  Cascading: Tap Dataiku - Pig, Hive and Cascading
  • 45. Partitions
 Motivation }  No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition }  Common partition schemas on Hadoop ◦  By Date /apache_logs/dt=2013-01-23 ◦  By Data center /apache_logs/dc=redbus01/… ◦  By Country ◦  … ◦  Or any combination of the above Dataiku - Pig, Hive and Cascading
  • 46. Hive Partitioning
 Partitioned tables CREATE TABLE event ( user_id INT, type STRING, message STRING) PARTITIONED BY (day STRING, server_id STRING); Disk structure /hive/event/day=2013-01-27/server_id=s1/file0 /hive/event/day=2013-01-27/server_id=s1/file1 /hive/event/day=2013-01-27/server_id=s2/file0 /hive/event/day=2013-01-27/server_id=s2/file1 … /hive/event/day=2013-01-28/server_id=s2/file0 /hive/event/day=2013-01-28/server_id=s2/file1 INSERT  OVERWRITE  TABLE    event  PARTITION(ds='2013-­‐01-­‐27',   server_id=‘s1’)   SELECT  *  FROM  event_tmp;   Dataiku Training – Hadoop for Data Science 4/14/13 46
  • 47. Cascading Partition }  No Direct support for partition }  Support for “Glob” Tap, to build read from files using patterns
 }  è You can code your own custom or virtual partition schemes Dataiku - Pig, Hive and Cascading
  • 48. External Code Integration
 Simple UDF Pig Hive Cascading Dataiku - Pig, Hive and Cascading
  • 49. Hive Complex UDF
 (Aggregators) Dataiku - Pig, Hive and Cascading
  • 50. Cascading 
 Direct Code Evaluation Dataiku - Pig, Hive and Cascading
  • 51. Integration
 Summary Partition/ External Code Format Incremental Integration Updates Pig No Direct Simple Doable and rich Support community Hive Fully integrated, Very simple, but Doable and SQL Like complex dev setup existing community Cascading With Coding Complex UDFS Doable and but regular, and growing Java Expression commuinty embeddable Dataiku - Pig, Hive and Cascading
  • 52. Comparing without Comparable   }  Philosophy ◦  Procedural Vs Declarative ◦  Data Model and Schema }  Productivity ◦  Headachability ◦  Checkpointing ◦  Testing and environment }  Integration ◦  Formats Integration ◦  Partitioning ◦  External Code Integration }  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 53. Optimization }  Several Common Map Reduce Optimization Patterns ◦  Combiners ◦  MapJoin ◦  Job Fusion ◦  Job Parallelism ◦  Reducer Parallelism }  Different support per framework ◦  Fully Automatic ◦  Pragma / Directives / Options ◦  Coding style / Code to write Dataiku - Pig, Hive and Cascading
  • 54. Combiner
 Perform Partial Aggregate at Mapper Stage SELECT  date,  COUNT(*)  FROM  product  GROUP  BY  date   2012-­‐02-­‐14  4354   Map …   Reduce 2012-­‐02-­‐14  4354   2012-­‐02-­‐14  20   2012-­‐02-­‐15  21we2   …     2012-­‐02-­‐15  21we2     2012-­‐02-­‐15  35   2012-­‐02-­‐14  qa334   …   2012-­‐02-­‐15  23aq2   2012-­‐02-­‐14  qa334     2012-­‐02-­‐16  1   …   2012-­‐02-­‐15  23aq2     Dataiku - Pig, Hive and Cascading
  • 55. Combiner
 Perform Partial Aggregate at Mapper Stage SELECT  date,  COUNT(*)  FROM  product  GROUP  BY  date   Map Reduce 2012-­‐02-­‐14  4354   2012-­‐02-­‐14  8   2012-­‐02-­‐14  20   …   2012-­‐02-­‐15  12   2012-­‐02-­‐15  21we2         2012-­‐02-­‐15  35   2012-­‐02-­‐14  qa334   …   2012-­‐02-­‐15  23aq2     2012-­‐02-­‐14  12   2012-­‐02-­‐16  1   2012-­‐02-­‐15  23   2012-­‐02-­‐16  1         Reduced network bandwith. Better parallelism Dataiku - Pig, Hive and Cascading
  • 56. Join Optimization
 Map Join Hive set hive.auto.convert.join = true; Pig Cascading ( no aggregation support after HashJoin) Dataiku - Pig, Hive and Cascading
  • 57. Number of Reducers }  Critical for performance }  Estimated per the size of input file ◦  Hive –  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB) ◦  Pig –  divide size pig.exec.reducers.bytes.per.reducer (default 1GB) Dataiku - Pig, Hive and Cascading
  • 58. Performance & Optimization 
 Summary Combiner Join Number of Optimization Optimization reducers optimization Pig Automatic Option Estimate or DIY Cascading DIY HashJoin DIY Hive Partial Automatic Estimate or DIY DIY (Map Join) Dataiku - Pig, Hive and Cascading
  • 59. Agenda }  Hadoop and Context (->0:03) }  Pig, Hive, Cascading, … (->0:06) }  How they work (->0:09) }  Comparing the tools (->0:25) }  Wrap’up and question (->0:30) Dataiku - Pig, Hive and Cascading
  • 60. }  Want to keep close to SQL ? ◦  Hive }  Want to write large flows ? ◦  Pig }  Want to integrate in large scale programming projects ◦  Cascading (cascalog / scalding) Dataiku - Pig, Hive and Cascading