SlideShare uma empresa Scribd logo
1 de 26
Hadoop Solutions

   By Zenyk Matchyshyn
  Staff Engineer @ Lohika
Agenda
   •    Why?
   •    Data in / Data out
   •    Data Formats
   •    Tools
   •    Providers
   •    Future
   •    Q/A




1/14/2013                    2
Why?
   •    Smart meter analysis
   •    Genome processing
   •    Sentiment & social media analysis
   •    Network capacity trending & management
   •    Ad targeting
   •    Fraud detection




1/14/2013                                        3
DATA IN / DATA OUT


1/14/2013               4
Flume

   •    Apache Flume is a distributed system for
        collecting streaming data.
   •    Developed by Cloudera, now Apache project
   •    Popular & supported
   •    Features:
            •   Centralized config
            •   Failover
            •   Reliability

1/14/2013                                           5
Flume - Responsibilities
•   Node – path from source to sink
•   Agent – collect data from local host and forwards
    to Collector
•   Collector – collects the data and writes into
    HDFS
•   Master – manages configuration and supports
    data flow




1/14/2013                                           6
Data in / Data out - other solutions


   •    Scribe https://github.com/facebook/scribe –
        similar to Flume
   •    Chukwa http://incubator.apache.org/chukwa/
        – similar to Flume
   •    Oozie http://oozie.apache.org/ - workflow
        scheduler




1/14/2013                                             7
Sqoop

   •    Apache project, originally from Cloudera
        http://sqoop.apache.org/
   •    Uses metadata to describe structure in HDFS
   •    Transport bulk data in & out from relational
        database
   •    Directly reading & writing from Map/Reduce
        as an alternative



1/14/2013                                              8
DATA FORMATS


1/14/2013         9
Formats

   •    Input and Output matter
   •    Data in files is splitted
   •    XML and JSON are supported
   •    Do document per-line or suffer the
        consequences ;)




1/14/2013                                    10
Serialization frameworks
   •    Binary in nature, makes things a bit more
        complicated
   •    Thrift & Protobuf vs SequenceFile & Avro
   •    Native formats support splitability and
        compression
   •    Avro supports code generation and
        versioning, just like Thrift & Protobuf
   •    Out-of-the-box support in Hadoop


1/14/2013                                           11
Compression

   •    Deflate (zlib)
   •    Gzip
   •    Bzip2 – splittable with additional work, slow
   •    LZO – block based
   •    LZOP – splittable with additional work
   •    Snappy – from Google, fast, but no splittability



1/14/2013                                               12
Testing
   •    MRUnit – unit testing for Map/Reduce jobs
        http://mrunit.apache.org/
   •    Data sampling for testing
   •    Data spikes detection




1/14/2013                                           13
Small files

   •    Small files are problematic because of big
        block size
   •    Can pack them into bigger Avro files
   •    Can move to Hbase
   •    Hadoop Archives (HAR) files




1/14/2013                                            14
TOOLS


1/14/2013   15
Pig
    •    High level language for data analysis
    •    Uses PigLatin to describe data flows
         (translates into MapReduce)
    •    Filters, Joins, Projections, Groupings, Counts,
         etc.
    •    Example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)


 1/14/2013                                                               16
Hive


   •    SQL-like interface - HiveQL
   •    Has its own structure
   •    Not a pipeline like Pig
   •    Basically a distributed data warehouse
   •    Has execution optimization




1/14/2013                                        17
HBase


•    Distributed, column oriented store
•    Independent of Hadoop
•    No translation into Map/Reduce
•    Stores data in MapFiles (indexed SequenceFiles)




1/14/2013                                        18
PROVIDERS


1/14/2013      19
Apache


   •    Umbrella for Hadoop projects
   •    No commercial support
   •    Active community
   •    Most recent builds




1/14/2013                              20
Cloudera

   •    Has its own tuned build – CDH
   •    Commercial support
   •    Certification & Training
   •    Has products on top of Hadoop (like Cloudera
        Manager etc.)
   •    Very high visibility




1/14/2013                                          21
Amazon Elastic MapReduce (EMR)
   •    Custom build tailored for AWS environment
   •    Very easy
   •    Uses S3 as a storage
   •    Uses SimpleDB for job flow state information
   •    Supports HBase




1/14/2013                                              22
HortonWorks


   •    Own platform on top of Hadoop
   •    Big backers like Microsoft and Yahoo
   •    Has trainings & certification




1/14/2013                                      23
FUTURE


1/14/2013   24
Future

 •    Percolator for incremental indexing and
      analysis of frequently changing datasets
 •    Dremel for ad hoc analytics
 •    Pregel for analyzing graph data
 •    ZooKeeper & Hadoop de-coupling with new
      execution engines to the rescue!




1/14/2013                                        25
Q/A


            ?
1/14/2013       26

Mais conteúdo relacionado

Mais procurados

HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
Chicago Hadoop Users Group
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 

Mais procurados (20)

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBaseCon 2015: Just the Basics
HBaseCon 2015: Just the BasicsHBaseCon 2015: Just the Basics
HBaseCon 2015: Just the Basics
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

Destaque

BillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalshCommunity.com_AAA Aggressive Driving Research UpdateBillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalsh FordLincolnKia
 
RMH-Process-White-Paper-v2
RMH-Process-White-Paper-v2RMH-Process-White-Paper-v2
RMH-Process-White-Paper-v2
Murry Shohat
 

Destaque (17)

BillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalshCommunity.com_AAA Aggressive Driving Research UpdateBillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalshCommunity.com_AAA Aggressive Driving Research Update
 
Anne-Lotte Masson - The Junior Thesis
Anne-Lotte Masson - The Junior ThesisAnne-Lotte Masson - The Junior Thesis
Anne-Lotte Masson - The Junior Thesis
 
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...
 
E marketingwerx Email Series 2
E marketingwerx Email Series 2E marketingwerx Email Series 2
E marketingwerx Email Series 2
 
Brizzle cake
Brizzle cakeBrizzle cake
Brizzle cake
 
Amazon Clouds in Action
Amazon Clouds in ActionAmazon Clouds in Action
Amazon Clouds in Action
 
The anatomy of a great email christopher barnes v2
The anatomy of a great email   christopher barnes v2The anatomy of a great email   christopher barnes v2
The anatomy of a great email christopher barnes v2
 
E marketingwerx Email Learning Series 1
E marketingwerx Email Learning Series 1E marketingwerx Email Learning Series 1
E marketingwerx Email Learning Series 1
 
RMH-Process-White-Paper-v2
RMH-Process-White-Paper-v2RMH-Process-White-Paper-v2
RMH-Process-White-Paper-v2
 
Slides from FIPP (global magazine association) webinar on 2016 mobile report
Slides from FIPP (global magazine association) webinar on 2016 mobile reportSlides from FIPP (global magazine association) webinar on 2016 mobile report
Slides from FIPP (global magazine association) webinar on 2016 mobile report
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
 
Puppet / DevOps - EDGE Lviv
Puppet / DevOps - EDGE LvivPuppet / DevOps - EDGE Lviv
Puppet / DevOps - EDGE Lviv
 
Deck from New Video Frontiers conference
Deck from New Video Frontiers conference Deck from New Video Frontiers conference
Deck from New Video Frontiers conference
 
Week 1 Dr Johnny Ryan UCD Fundamental of Management
Week 1 Dr Johnny Ryan UCD Fundamental of Management Week 1 Dr Johnny Ryan UCD Fundamental of Management
Week 1 Dr Johnny Ryan UCD Fundamental of Management
 
PageFair-DCN global stakeholders' roundtable on adblocking
PageFair-DCN global stakeholders' roundtable on adblocking PageFair-DCN global stakeholders' roundtable on adblocking
PageFair-DCN global stakeholders' roundtable on adblocking
 
Rapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache HadoopRapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache Hadoop
 
Проект Каскад
Проект КаскадПроект Каскад
Проект Каскад
 

Semelhante a Hadoop Solutions

Semelhante a Hadoop Solutions (20)

OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
 
Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 

Mais de zenyk (8)

Semasearch Spring - 2015
Semasearch   Spring - 2015Semasearch   Spring - 2015
Semasearch Spring - 2015
 
Ecois.me and uMuni
Ecois.me and uMuniEcois.me and uMuni
Ecois.me and uMuni
 
Semasearch Intro
Semasearch IntroSemasearch Intro
Semasearch Intro
 
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті державиSEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
 
Introduction to Clojure - EDGE Lviv
Introduction to Clojure - EDGE LvivIntroduction to Clojure - EDGE Lviv
Introduction to Clojure - EDGE Lviv
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Emotional Intelligence
Emotional IntelligenceEmotional Intelligence
Emotional Intelligence
 
Modern Java Web Development
Modern Java Web DevelopmentModern Java Web Development
Modern Java Web Development
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Hadoop Solutions

  • 1. Hadoop Solutions By Zenyk Matchyshyn Staff Engineer @ Lohika
  • 2. Agenda • Why? • Data in / Data out • Data Formats • Tools • Providers • Future • Q/A 1/14/2013 2
  • 3. Why? • Smart meter analysis • Genome processing • Sentiment & social media analysis • Network capacity trending & management • Ad targeting • Fraud detection 1/14/2013 3
  • 4. DATA IN / DATA OUT 1/14/2013 4
  • 5. Flume • Apache Flume is a distributed system for collecting streaming data. • Developed by Cloudera, now Apache project • Popular & supported • Features: • Centralized config • Failover • Reliability 1/14/2013 5
  • 6. Flume - Responsibilities • Node – path from source to sink • Agent – collect data from local host and forwards to Collector • Collector – collects the data and writes into HDFS • Master – manages configuration and supports data flow 1/14/2013 6
  • 7. Data in / Data out - other solutions • Scribe https://github.com/facebook/scribe – similar to Flume • Chukwa http://incubator.apache.org/chukwa/ – similar to Flume • Oozie http://oozie.apache.org/ - workflow scheduler 1/14/2013 7
  • 8. Sqoop • Apache project, originally from Cloudera http://sqoop.apache.org/ • Uses metadata to describe structure in HDFS • Transport bulk data in & out from relational database • Directly reading & writing from Map/Reduce as an alternative 1/14/2013 8
  • 10. Formats • Input and Output matter • Data in files is splitted • XML and JSON are supported • Do document per-line or suffer the consequences ;) 1/14/2013 10
  • 11. Serialization frameworks • Binary in nature, makes things a bit more complicated • Thrift & Protobuf vs SequenceFile & Avro • Native formats support splitability and compression • Avro supports code generation and versioning, just like Thrift & Protobuf • Out-of-the-box support in Hadoop 1/14/2013 11
  • 12. Compression • Deflate (zlib) • Gzip • Bzip2 – splittable with additional work, slow • LZO – block based • LZOP – splittable with additional work • Snappy – from Google, fast, but no splittability 1/14/2013 12
  • 13. Testing • MRUnit – unit testing for Map/Reduce jobs http://mrunit.apache.org/ • Data sampling for testing • Data spikes detection 1/14/2013 13
  • 14. Small files • Small files are problematic because of big block size • Can pack them into bigger Avro files • Can move to Hbase • Hadoop Archives (HAR) files 1/14/2013 14
  • 16. Pig • High level language for data analysis • Uses PigLatin to describe data flows (translates into MapReduce) • Filters, Joins, Projections, Groupings, Counts, etc. • Example: A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name; DUMP B; (John) (Mary) 1/14/2013 16
  • 17. Hive • SQL-like interface - HiveQL • Has its own structure • Not a pipeline like Pig • Basically a distributed data warehouse • Has execution optimization 1/14/2013 17
  • 18. HBase • Distributed, column oriented store • Independent of Hadoop • No translation into Map/Reduce • Stores data in MapFiles (indexed SequenceFiles) 1/14/2013 18
  • 20. Apache • Umbrella for Hadoop projects • No commercial support • Active community • Most recent builds 1/14/2013 20
  • 21. Cloudera • Has its own tuned build – CDH • Commercial support • Certification & Training • Has products on top of Hadoop (like Cloudera Manager etc.) • Very high visibility 1/14/2013 21
  • 22. Amazon Elastic MapReduce (EMR) • Custom build tailored for AWS environment • Very easy • Uses S3 as a storage • Uses SimpleDB for job flow state information • Supports HBase 1/14/2013 22
  • 23. HortonWorks • Own platform on top of Hadoop • Big backers like Microsoft and Yahoo • Has trainings & certification 1/14/2013 23
  • 25. Future • Percolator for incremental indexing and analysis of frequently changing datasets • Dremel for ad hoc analytics • Pregel for analyzing graph data • ZooKeeper & Hadoop de-coupling with new execution engines to the rescue! 1/14/2013 25
  • 26. Q/A ? 1/14/2013 26