SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Common	
  and	
  Unique	
  Use	
  Cases	
  
for	
  Apache	
  Hadoop	
  
	
  
August	
  30,	
  2011	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Exploding	
  Data	
  Volumes	
  
•  Online	
  
       •      Web-­‐ready	
  devices	
  
       •      Social	
  media	
  
                                                                                                                            Complex, Unstructured
       •      Digital	
  content	
  
       •      Smart	
  grids	
  


•  Enterprise	
                                                                   Relational

       •  TransacBons	
  	
  
       •  R&D	
  data	
  
       •  OperaBonal	
  (control)	
  data	
  
                                                                                                 	
  
       	
                                                                                        Digital	
  universe	
  grew	
  by	
  62%	
  last	
  year	
  to	
  
       2,500	
  exabytes	
  of	
  new	
  informaBon	
  in	
                                      800K	
  petabytes	
  and	
  will	
  grow	
  to	
  1.2	
  
       2012	
  with	
  Internet	
  as	
  primary	
  driver	
                                     “zeabytes”	
  this	
  year	
  	
  
                                                                                                 Source:	
  An	
  IDC	
  White	
  Paper	
  -­‐	
  sponsored	
  by	
  EMC.	
  As	
  the	
  Economy	
  Contracts,	
  the	
  
	
                                                                                               Digital	
  Universe	
  Expands.	
  May	
  2009	
  

                                                                                       	
  
                                           Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Origin	
  of	
  Hadoop	
  
How	
  does	
  an	
  elephant	
  sneak	
  up	
  on	
  you?	
  



                                                                                                                           Hadoop	
  wins	
  
                                                                                                                           Terabyte	
  sort	
  
                                                                                                                           benchmark	
  

                                                                                                                                                                Releases	
  
                                                                           Open	
  Source,	
                                                                    CDH3	
  and	
  
                                  Publishes	
                              MapReduce	
                                                                          Cloudera	
  
                                  MapReduce,	
                             &	
  HDFS	
                          Runs	
  4,000	
                                 Enterprise	
  
    Open	
  Source,	
             GFS	
  Paper	
                           project	
                            Node	
  Hadoop	
  
    Web	
  Crawler	
                                                       created	
  by	
                      Cluster	
  
    project	
                                                                                                                             Launches	
  SQL	
  
                                                                           Doug	
  Cucng	
  
    created	
  by	
                                                                                                                       Support	
  for	
  
    Doug	
  Cucng	
                                                                                                                       Hadoop	
  


2002	
             2003	
     2004	
            2005	
               2006	
                2007	
               2008	
               2009	
             2010	
  




                                           Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
What	
  is	
  Apache	
  Hadoop?	
  
Open	
  Source	
  Storage	
  and	
  Processing	
  Engine	
  


                                                                              • 	
  Consolidates	
  Everything	
  
                                                                                        • 	
  Move	
  complex	
  and	
  relaBonal	
  	
  
                                                                                        data	
  into	
  a	
  single	
  repository	
  

                                                                              • 	
  Stores	
  Inexpensively	
  
                                                                                        • 	
  Keep	
  raw	
  data	
  always	
  available	
  
                 MapReduce	
  
                                                                                        • 	
  Use	
  commodity	
  hardware	
  

                                                                              • 	
  Processes	
  at	
  the	
  Source	
  
                                                                                        • 	
  Eliminate	
  ETL	
  bolenecks	
  
          Hadoop	
  Distributed	
                                                       • 	
  Mine	
  data	
  first,	
  govern	
  later	
  	
  
          File	
  System	
  (HDFS)	
  


                                Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
What	
  is	
  Apache	
  Hadoop?	
  
The	
  Standard	
  Way	
  Big	
  Data	
  Gets	
  Done	
  

•  Hadoop	
  is	
  Flexible:	
  
       •    Structured,	
  unstructured	
  
       •    Schema,	
  no	
  schema	
  
       •    High	
  volume,	
  merely	
  terabytes	
  
       •    All	
  kinds	
  of	
  analyBc	
  applicaBons	
  

•  Hadoop	
  is	
  Open:	
  100%	
  Apache-­‐licensed	
  open	
  source	
  

•  Hadoop	
  is	
  Scalable:	
  Proven	
  at	
  petabyte	
  scale	
  

•  Benefits:	
  
      •  Controls	
  costs	
  by	
  storing	
  data	
  more	
  affordably	
  per	
  terabyte	
  than	
  any	
  other	
  
         plalorm	
  
      •  Drives	
  revenue	
  by	
  extracBng	
  value	
  from	
  data	
  that	
  was	
  previously	
  out	
  of	
  reach	
  


                                     Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
What	
  is	
  Apache	
  Hadoop?	
  
The	
  Importance	
  of	
  Being	
  Open	
  



  No	
  Lock-­‐In	
  -­‐	
  Investments	
  in	
  skills,	
  services	
  &	
  	
  
  hardware	
  are	
  preserved	
  regardless	
  of	
  vendor	
  choice	
  



  Community	
  Development	
  -­‐	
  Hadoop	
  &	
  	
  
  related	
  projects	
  are	
  expanding	
  at	
  a	
  	
  
  rapid	
  pace	
  


  Rich	
  Ecosystem	
  -­‐	
  Dozens	
  of	
  	
  
  complementary	
  somware,	
  hardware	
  
  	
  and	
  services	
  firms	
  	
  

                                     Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  

•  Common	
  uses	
  of	
  logs	
  
       •  Find	
  or	
  count	
  events	
  (grep)	
  
       grep	
  “ERROR”	
  file	
  
       grep	
  -­‐c	
  “ERROR”	
  file	
  


       •  Calculate	
  metrics	
  (performance	
  or	
  user	
  behavior	
  analysis)	
  
       awk	
  ‘{sums[$1]+=$2;	
  counts[$1]+=1}	
  END	
  {for(k	
  in	
  counts)	
  {print	
  sums[k]/counts	
  [k]}}’	
  


       •  InvesBgate	
  user	
  sessions	
  
       grep	
  “USER”	
  files	
  …	
  |	
  sort	
  |	
  less	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  
•  Shoot…too	
  much	
  data	
  

       •  Homegrown	
  parallel	
  processing	
  omen	
  done	
  on	
  per	
  file	
  basis,	
  cause	
  it’s	
  
          easy	
  

               •  No	
  parallelism	
  on	
  a	
  single	
  large	
  file	
  

                                                        Task	
  0	
  


                                                             access_log	
  



                                  Task	
  1	
                                  Task	
  2	
  


                                  access_log	
                                      access_log	
  
Log	
  Processing	
  
  A	
  Perfect	
  Fit	
  
  •  MapReduce	
  to	
  the	
  rescue!	
  

         •  Processing	
  is	
  done	
  per	
  unit	
  of	
  data	
  



                                              Task	
  0	
                Task	
  1	
                    Task	
  2	
            Task	
  3	
  

access_log	
  

                                   	
  	
  	
  0-­‐64MB	
     	
     	
  64-­‐128MB                   	
  128-­‐192MB   	
  192-­‐256MB	
  


                            Each	
  task	
  is	
  responsible	
  for	
  a	
  unit	
  of	
  data	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  
•  Network	
  or	
  disk	
  are	
  bolenecks

       •  Reading	
  100GB	
  of	
  data	
  

               •  14	
  minutes	
  with	
  1GbE	
  network	
  connecBon	
  

               •  22	
  minutes	
  on	
  standard	
  disk	
  drive	
  




                                                                              access_log	
  
                                                                   ited	
  
                                             Bandwidth	
  is	
  lim
                          grep	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  
•  Hadoop	
  to	
  the	
  rescue!	
  

       •  Eliminates	
  network	
  boleneck,	
  data	
  is	
  on	
  local	
  disk	
  

       •  Data	
  is	
  read	
  from	
  many,	
  many	
  disks	
  in	
  parallel	
  

	
                                                        Physical	
  Machines	
  

          NodeA	
                      NodeX	
                            NodeY	
             NodeZ	
  


            Task	
  0	
                   Task	
  1	
                       Task	
  2	
        Task	
  3	
  




           0-­‐64MB	
                  64-­‐128MB	
                      128-­‐192MB	
      192-­‐256MB	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  
•  Hadoop	
  currently	
  scales	
  to	
  4,000	
  nodes	
  

       •  Goal	
  for	
  next	
  release	
  is	
  10,000	
  nodes	
  

•  Nodes	
  typically	
  have	
  12	
  hard	
  drives	
  

•  A	
  single	
  hard	
  drive	
  has	
  throughput	
  of	
  about	
  75MB/second	
  

•  12	
  Hard	
  Drives	
  *	
  75	
  MB/second	
  *	
  4000	
  Nodes	
  =	
  3.4	
  TB/second	
  

       •  That’s	
  bytes,	
  not	
  bits	
  

•  That’s	
  enough	
  bandwidth	
  to	
  read	
  1PB	
  (1000	
  TB)	
  in	
  5	
  minutes	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Catching	
  `Osama’	
  
Embarrassingly	
  Parallel	
  

•  You	
  have	
  a	
  few	
  billion	
  images	
  of	
  faces	
  with	
  geo-­‐tags	
  
     •  Tremendous	
  storage	
  problem	
  

     •  Tremendous	
  processing	
  problem	
  

          •  Bandwidth	
  

          •  CoordinaBon	
  
Catching	
  `Osama’	
  
Embarrassingly	
  Parallel	
  

•  Store	
  the	
  images	
  in	
  Hadoop	
  

•  When	
  processing,	
  Hadoop	
  will	
  read	
  the	
  images	
  from	
  
   local	
  disk,	
  thousands	
  of	
  local	
  disks	
  spread	
  throughout	
  
   the	
  cluster	
  

•  Use	
  Map	
  only	
  job	
  to	
  compare	
  input	
  images	
  against	
  
   `needle’	
  image	
  
Catching	
  `Osama’	
  
Embarrassingly	
  Parallel	
  
                                                                         Tasks	
  have	
  copy	
  of	
  `needle’	
  




                                                 Map	
  Task	
  0	
     Map	
  Task	
  1	
  
                                                        	
                     	
  
                                                        	
                     	
  



Store	
  images	
  in	
  Sequence	
  Files	
  




                                                                                                                   Output	
  faces	
  
                                                                                                                   `matching’	
  needle	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Extract	
  Transform	
  Load	
  (ETL)	
  
Everyone	
  is	
  doing	
  it	
  

•  One	
  of	
  the	
  most	
  common	
  use	
  cases	
  I	
  see	
  is	
  replacing	
  
   ETL	
  processes	
  

•  Hadoop	
  is	
  a	
  huge	
  sink	
  of	
  cheap	
  storage	
  and	
  processing	
  

•  Aggregates	
  built	
  in	
  Hadoop	
  and	
  exported	
  

•  Apache	
  Hive	
  provides	
  SQL	
  like	
  querying	
  on	
  raw	
  data	
  
Extract	
  Transform	
  Load	
  (ETL)	
  
Everyone	
  is	
  doing	
  it	
  


`Real’	
  Time	
  System	
  (Website)	
                                  Data	
  Warehouse	
  

                                                                             Business	
  
                                                                           Intelligence	
  
                                                                           ApplicaBons	
  




                Online	
                                                    AnalyBcal	
  
                 DB	
                                                          DB	
  
                                                             ETL	
  



                                    Much	
  blood	
  shed,	
  here	
  
Extract	
  Transform	
  Load	
  (ETL)	
  
Everyone	
  is	
  doing	
  it	
  


`Real’	
  Time	
  System	
  (Website)	
                                      Data	
  Warehouse	
  

                                                                                 Business	
  
                                                                               Intelligence	
  
                                                                               ApplicaBons	
  




                Online	
                                                        AnalyBcal	
  
                 DB	
                                                              DB	
  
                                    Import         Hadoop	
  
                                            	
  
                                                                Export	
  
Extract	
  Transform	
  Load	
  (ETL)	
  
Everyone	
  is	
  doing	
  it	
  


`Real’	
  Time	
  System	
  (Website)	
                                        Data	
  Warehouse	
  

                                                                                   Business	
  
                                                                                 Intelligence	
  
                                                                                 ApplicaBons	
  




                Online	
                                                          AnalyBcal	
  
                 DB	
                                                                DB	
  
                                    Apache           Hadoop	
  
                                              	
  
                                    Sqoop
                                         	
                       Apache	
  
                                                                   Sqoop	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
AnalyScs	
  in	
  HBase	
  
Scaling	
  writes	
  
•  AnalyBcs	
  is	
  omen	
  simply	
  counBng	
  things	
  

•  Facebook	
  chose	
  HBase	
  to	
  store	
  it’s	
  massive	
  counter	
  infrastructure	
  (more	
  
   later)	
  

•  How	
  might	
  one	
  implement	
  a	
  counter	
  infrastructure	
  in	
  HBase?	
  
AnalyScs	
  in	
  HBase	
  
Scaling	
  writes	
  


                                                            User	
  &	
  Content	
  Type	
  Counters	
  
   `Like’	
  buon	
  IMG	
  request	
  	
  
     sends	
  HTTP	
  request	
  to	
          User	
                  Content	
           Counter	
  
    Facebook	
  servers	
  which	
             brock@me.com	
   NEWS	
                     5431	
  
 increments	
  several	
  counters	
  
                                               brock@me.com	
   TECH	
                     79310	
  
                                               brock@me.com	
   SHOPPING	
                 59	
  
                                               tom@him.com	
   SPORTS	
                    94214	
  


                                                          Individual	
  Page	
  Counters	
  
                                               URL	
                                           Counter	
  
                                               com.cloudera/blog/…	
                           154	
  
                                               com.cloudera/downloads/…	
                      923621	
  
                                               com.cloudera/resources/…	
                      2138	
  
AnalyScs	
  in	
  HBase	
  
 Scaling	
  writes	
  

                                                                                      Individual	
  Page	
  Counters	
  
Host	
  is	
  reversed	
  in	
  URL	
  as	
  part	
  of	
  the	
  key	
     URL	
                                          Counter	
  
                                                                            com.cloudera/blog/…	
                          154	
  
                                                                            com.cloudera/downloads/…	
                     923621	
  
                                                                            com.cloudera/resources/…	
                     2138	
  




     •  Data	
  is	
  physically	
  stored	
  in	
  sorted	
  order	
  
        	
  
     •  Scanning	
  all	
  `com.cloudera’	
  counters	
  results	
  in	
  sequenBal	
  I/O	
  
Facebook	
  AnalyScs	
  
Scaling	
  writes	
  

•  Real-­‐Bme	
  counters	
  of	
  URLs	
  shared,	
  links	
  “liked”,	
  
   impressions	
  generated	
  

•  20	
  billion	
  events/day	
  (200K	
  events/sec)	
  

•  ~30	
  second	
  latency	
  from	
  click	
  to	
  count	
  

•  Heavy	
  use	
  of	
  incrementColumnValue	
  API	
  for	
  
   consistent	
  counters	
  

•  Tried	
  MySQL,	
  Cassandra,	
  seled	
  on	
  HBase	
  
    	
  
    hp://Bny.cloudera.com/hbase-­‐„-­‐analyBcs	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Machine	
  Learning	
  
Apache	
  Mahout	
  


       Text	
  Clustering	
  on	
  Google	
  News	
  
Machine	
  Learning	
  
Apache	
  Mahout	
  


       CollaboraBve	
  Filtering	
  on	
  Amazon	
  
Machine	
  Learning	
  
Apache	
  Mahout	
  



      ClassificaBon	
  in	
  GMail	
  
Machine	
  Learning	
  
Apache	
  Mahout	
  

•  Apache	
  Mahout	
  implements	
  
     •  CollaboraBve	
  Filtering	
  	
  

     •  ClassificaBon	
  	
  

     •  Clustering	
  

     •  Frequent	
  itemset	
  

•  More	
  coming	
  with	
  the	
  integraBon	
  of	
  MapReduce.Next	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Final	
  Thoughts	
  
Use	
  the	
  right	
  tool	
  
•    Other	
  use	
  cases	
  

       •     OpenTSDB	
  an	
  open	
  distributed,	
  scalable	
  Time	
  Series	
  Database	
  (TSDB)	
  

       •     Building	
  Search	
  Indexes	
  (canonical	
  use	
  case)	
  

       •     Facebook	
  Messaging	
  

       •     Cheap	
  and	
  Deep	
  Storage,	
  e.g.	
  archiving	
  emails	
  for	
  SOX	
  compliance	
  

       •     Audit	
  Logging	
  

•    Non-­‐Use	
  Cases	
  

       •     Data	
  processing	
  is	
  handled	
  by	
  one	
  beefy	
  server	
  

       •     Data	
  requires	
  transacBons	
  
About	
  the	
  Presenter	
  
•  Brock	
  Noland	
  

•  brock@cloudera.com	
  

•  hp://twier.com/brocknoland	
  

•  TC-­‐HUG	
  hp://tch.ug	
  

Mais conteúdo relacionado

Mais procurados

Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2DataWorks Summit
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
The Destiny of Data
The Destiny of DataThe Destiny of Data
The Destiny of DataHortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 

Mais procurados (20)

Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Admiral Group
Admiral GroupAdmiral Group
Admiral Group
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
The Destiny of Data
The Destiny of DataThe Destiny of Data
The Destiny of Data
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 

Destaque

Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Apache Hadoop: Introduzione all’architettura ed approcci applicativi
Apache Hadoop: Introduzione all’architettura ed approcci applicativiApache Hadoop: Introduzione all’architettura ed approcci applicativi
Apache Hadoop: Introduzione all’architettura ed approcci applicativiDario Catalano
 
SplunkLive! Hunk Technical Overview
SplunkLive! Hunk Technical OverviewSplunkLive! Hunk Technical Overview
SplunkLive! Hunk Technical OverviewSplunk
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses HadoopNarayan Bharadwaj
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Hadoop - Introduzione all’architettura ed approcci applicativi
Hadoop - Introduzione all’architettura ed approcci applicativiHadoop - Introduzione all’architettura ed approcci applicativi
Hadoop - Introduzione all’architettura ed approcci applicativilostrettodigitale
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitternkallen
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagramiammutex
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Destaque (19)

Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data Infrastructures - Hadoop ecosystem, M. E. Piras
Big Data Infrastructures - Hadoop ecosystem, M. E. PirasBig Data Infrastructures - Hadoop ecosystem, M. E. Piras
Big Data Infrastructures - Hadoop ecosystem, M. E. Piras
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Apache Hadoop: Introduzione all’architettura ed approcci applicativi
Apache Hadoop: Introduzione all’architettura ed approcci applicativiApache Hadoop: Introduzione all’architettura ed approcci applicativi
Apache Hadoop: Introduzione all’architettura ed approcci applicativi
 
SplunkLive! Hunk Technical Overview
SplunkLive! Hunk Technical OverviewSplunkLive! Hunk Technical Overview
SplunkLive! Hunk Technical Overview
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Hadoop - Introduzione all’architettura ed approcci applicativi
Hadoop - Introduzione all’architettura ed approcci applicativiHadoop - Introduzione all’architettura ed approcci applicativi
Hadoop - Introduzione all’architettura ed approcci applicativi
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Semelhante a Common and unique use cases for Apache Hadoop

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondDataWorks Summit
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 

Semelhante a Common and unique use cases for Apache Hadoop (20)

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 

Último

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Último (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Common and unique use cases for Apache Hadoop

  • 1. Common  and  Unique  Use  Cases   for  Apache  Hadoop     August  30,  2011  
  • 2. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 3. Exploding  Data  Volumes   •  Online   •  Web-­‐ready  devices   •  Social  media   Complex, Unstructured •  Digital  content   •  Smart  grids   •  Enterprise   Relational •  TransacBons     •  R&D  data   •  OperaBonal  (control)  data       Digital  universe  grew  by  62%  last  year  to   2,500  exabytes  of  new  informaBon  in   800K  petabytes  and  will  grow  to  1.2   2012  with  Internet  as  primary  driver   “zeabytes”  this  year     Source:  An  IDC  White  Paper  -­‐  sponsored  by  EMC.  As  the  Economy  Contracts,  the     Digital  Universe  Expands.  May  2009     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 4. Origin  of  Hadoop   How  does  an  elephant  sneak  up  on  you?   Hadoop  wins   Terabyte  sort   benchmark   Releases   Open  Source,   CDH3  and   Publishes   MapReduce   Cloudera   MapReduce,   &  HDFS   Runs  4,000   Enterprise   Open  Source,   GFS  Paper   project   Node  Hadoop   Web  Crawler   created  by   Cluster   project   Launches  SQL   Doug  Cucng   created  by   Support  for   Doug  Cucng   Hadoop   2002   2003   2004   2005   2006   2007   2008   2009   2010   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 5. What  is  Apache  Hadoop?   Open  Source  Storage  and  Processing  Engine   •   Consolidates  Everything   •   Move  complex  and  relaBonal     data  into  a  single  repository   •   Stores  Inexpensively   •   Keep  raw  data  always  available   MapReduce   •   Use  commodity  hardware   •   Processes  at  the  Source   •   Eliminate  ETL  bolenecks   Hadoop  Distributed   •   Mine  data  first,  govern  later     File  System  (HDFS)   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 6. What  is  Apache  Hadoop?   The  Standard  Way  Big  Data  Gets  Done   •  Hadoop  is  Flexible:   •  Structured,  unstructured   •  Schema,  no  schema   •  High  volume,  merely  terabytes   •  All  kinds  of  analyBc  applicaBons   •  Hadoop  is  Open:  100%  Apache-­‐licensed  open  source   •  Hadoop  is  Scalable:  Proven  at  petabyte  scale   •  Benefits:   •  Controls  costs  by  storing  data  more  affordably  per  terabyte  than  any  other   plalorm   •  Drives  revenue  by  extracBng  value  from  data  that  was  previously  out  of  reach   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 7. What  is  Apache  Hadoop?   The  Importance  of  Being  Open   No  Lock-­‐In  -­‐  Investments  in  skills,  services  &     hardware  are  preserved  regardless  of  vendor  choice   Community  Development  -­‐  Hadoop  &     related  projects  are  expanding  at  a     rapid  pace   Rich  Ecosystem  -­‐  Dozens  of     complementary  somware,  hardware    and  services  firms     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 8. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 9. Log  Processing   A  Perfect  Fit   •  Common  uses  of  logs   •  Find  or  count  events  (grep)   grep  “ERROR”  file   grep  -­‐c  “ERROR”  file   •  Calculate  metrics  (performance  or  user  behavior  analysis)   awk  ‘{sums[$1]+=$2;  counts[$1]+=1}  END  {for(k  in  counts)  {print  sums[k]/counts  [k]}}’   •  InvesBgate  user  sessions   grep  “USER”  files  …  |  sort  |  less  
  • 10. Log  Processing   A  Perfect  Fit   •  Shoot…too  much  data   •  Homegrown  parallel  processing  omen  done  on  per  file  basis,  cause  it’s   easy   •  No  parallelism  on  a  single  large  file   Task  0   access_log   Task  1   Task  2   access_log   access_log  
  • 11. Log  Processing   A  Perfect  Fit   •  MapReduce  to  the  rescue!   •  Processing  is  done  per  unit  of  data   Task  0   Task  1   Task  2   Task  3   access_log        0-­‐64MB      64-­‐128MB  128-­‐192MB  192-­‐256MB   Each  task  is  responsible  for  a  unit  of  data  
  • 12. Log  Processing   A  Perfect  Fit   •  Network  or  disk  are  bolenecks •  Reading  100GB  of  data   •  14  minutes  with  1GbE  network  connecBon   •  22  minutes  on  standard  disk  drive   access_log   ited   Bandwidth  is  lim grep  
  • 13. Log  Processing   A  Perfect  Fit   •  Hadoop  to  the  rescue!   •  Eliminates  network  boleneck,  data  is  on  local  disk   •  Data  is  read  from  many,  many  disks  in  parallel     Physical  Machines   NodeA   NodeX   NodeY   NodeZ   Task  0   Task  1   Task  2   Task  3   0-­‐64MB   64-­‐128MB   128-­‐192MB   192-­‐256MB  
  • 14. Log  Processing   A  Perfect  Fit   •  Hadoop  currently  scales  to  4,000  nodes   •  Goal  for  next  release  is  10,000  nodes   •  Nodes  typically  have  12  hard  drives   •  A  single  hard  drive  has  throughput  of  about  75MB/second   •  12  Hard  Drives  *  75  MB/second  *  4000  Nodes  =  3.4  TB/second   •  That’s  bytes,  not  bits   •  That’s  enough  bandwidth  to  read  1PB  (1000  TB)  in  5  minutes  
  • 15. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 16. Catching  `Osama’   Embarrassingly  Parallel   •  You  have  a  few  billion  images  of  faces  with  geo-­‐tags   •  Tremendous  storage  problem   •  Tremendous  processing  problem   •  Bandwidth   •  CoordinaBon  
  • 17. Catching  `Osama’   Embarrassingly  Parallel   •  Store  the  images  in  Hadoop   •  When  processing,  Hadoop  will  read  the  images  from   local  disk,  thousands  of  local  disks  spread  throughout   the  cluster   •  Use  Map  only  job  to  compare  input  images  against   `needle’  image  
  • 18. Catching  `Osama’   Embarrassingly  Parallel   Tasks  have  copy  of  `needle’   Map  Task  0   Map  Task  1           Store  images  in  Sequence  Files   Output  faces   `matching’  needle  
  • 19. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 20. Extract  Transform  Load  (ETL)   Everyone  is  doing  it   •  One  of  the  most  common  use  cases  I  see  is  replacing   ETL  processes   •  Hadoop  is  a  huge  sink  of  cheap  storage  and  processing   •  Aggregates  built  in  Hadoop  and  exported   •  Apache  Hive  provides  SQL  like  querying  on  raw  data  
  • 21. Extract  Transform  Load  (ETL)   Everyone  is  doing  it   `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   ETL   Much  blood  shed,  here  
  • 22. Extract  Transform  Load  (ETL)   Everyone  is  doing  it   `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   Import Hadoop     Export  
  • 23. Extract  Transform  Load  (ETL)   Everyone  is  doing  it   `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   Apache Hadoop     Sqoop   Apache   Sqoop  
  • 24. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 25. AnalyScs  in  HBase   Scaling  writes   •  AnalyBcs  is  omen  simply  counBng  things   •  Facebook  chose  HBase  to  store  it’s  massive  counter  infrastructure  (more   later)   •  How  might  one  implement  a  counter  infrastructure  in  HBase?  
  • 26. AnalyScs  in  HBase   Scaling  writes   User  &  Content  Type  Counters   `Like’  buon  IMG  request     sends  HTTP  request  to   User   Content   Counter   Facebook  servers  which   brock@me.com   NEWS   5431   increments  several  counters   brock@me.com   TECH   79310   brock@me.com   SHOPPING   59   tom@him.com   SPORTS   94214   Individual  Page  Counters   URL   Counter   com.cloudera/blog/…   154   com.cloudera/downloads/…   923621   com.cloudera/resources/…   2138  
  • 27. AnalyScs  in  HBase   Scaling  writes   Individual  Page  Counters   Host  is  reversed  in  URL  as  part  of  the  key   URL   Counter   com.cloudera/blog/…   154   com.cloudera/downloads/…   923621   com.cloudera/resources/…   2138   •  Data  is  physically  stored  in  sorted  order     •  Scanning  all  `com.cloudera’  counters  results  in  sequenBal  I/O  
  • 28. Facebook  AnalyScs   Scaling  writes   •  Real-­‐Bme  counters  of  URLs  shared,  links  “liked”,   impressions  generated   •  20  billion  events/day  (200K  events/sec)   •  ~30  second  latency  from  click  to  count   •  Heavy  use  of  incrementColumnValue  API  for   consistent  counters   •  Tried  MySQL,  Cassandra,  seled  on  HBase     hp://Bny.cloudera.com/hbase-­‐„-­‐analyBcs  
  • 29. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 30. Machine  Learning   Apache  Mahout   Text  Clustering  on  Google  News  
  • 31. Machine  Learning   Apache  Mahout   CollaboraBve  Filtering  on  Amazon  
  • 32. Machine  Learning   Apache  Mahout   ClassificaBon  in  GMail  
  • 33. Machine  Learning   Apache  Mahout   •  Apache  Mahout  implements   •  CollaboraBve  Filtering     •  ClassificaBon     •  Clustering   •  Frequent  itemset   •  More  coming  with  the  integraBon  of  MapReduce.Next  
  • 34. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 35. Final  Thoughts   Use  the  right  tool   •  Other  use  cases   •  OpenTSDB  an  open  distributed,  scalable  Time  Series  Database  (TSDB)   •  Building  Search  Indexes  (canonical  use  case)   •  Facebook  Messaging   •  Cheap  and  Deep  Storage,  e.g.  archiving  emails  for  SOX  compliance   •  Audit  Logging   •  Non-­‐Use  Cases   •  Data  processing  is  handled  by  one  beefy  server   •  Data  requires  transacBons  
  • 36. About  the  Presenter   •  Brock  Noland   •  brock@cloudera.com   •  hp://twier.com/brocknoland   •  TC-­‐HUG  hp://tch.ug