SlideShare uma empresa Scribd logo
1 de 26
Hadoop Solutions

   By Zenyk Matchyshyn
  Staff Engineer @ Lohika
Agenda
   •    Why?
   •    Data in / Data out
   •    Data Formats
   •    Tools
   •    Providers
   •    Future
   •    Q/A




1/14/2013                    2
Why?
   •    Smart meter analysis
   •    Genome processing
   •    Sentiment & social media analysis
   •    Network capacity trending & management
   •    Ad targeting
   •    Fraud detection




1/14/2013                                        3
DATA IN / DATA OUT


1/14/2013               4
Flume

   •    Apache Flume is a distributed system for
        collecting streaming data.
   •    Developed by Cloudera, now Apache project
   •    Popular & supported
   •    Features:
            •   Centralized config
            •   Failover
            •   Reliability

1/14/2013                                           5
Flume - Responsibilities
•   Node – path from source to sink
•   Agent – collect data from local host and forwards
    to Collector
•   Collector – collects the data and writes into
    HDFS
•   Master – manages configuration and supports
    data flow




1/14/2013                                           6
Data in / Data out - other solutions


   •    Scribe https://github.com/facebook/scribe –
        similar to Flume
   •    Chukwa http://incubator.apache.org/chukwa/
        – similar to Flume
   •    Oozie http://oozie.apache.org/ - workflow
        scheduler




1/14/2013                                             7
Sqoop

   •    Apache project, originally from Cloudera
        http://sqoop.apache.org/
   •    Uses metadata to describe structure in HDFS
   •    Transport bulk data in & out from relational
        database
   •    Directly reading & writing from Map/Reduce
        as an alternative



1/14/2013                                              8
DATA FORMATS


1/14/2013         9
Formats

   •    Input and Output matter
   •    Data in files is splitted
   •    XML and JSON are supported
   •    Do document per-line or suffer the
        consequences ;)




1/14/2013                                    10
Serialization frameworks
   •    Binary in nature, makes things a bit more
        complicated
   •    Thrift & Protobuf vs SequenceFile & Avro
   •    Native formats support splitability and
        compression
   •    Avro supports code generation and
        versioning, just like Thrift & Protobuf
   •    Out-of-the-box support in Hadoop


1/14/2013                                           11
Compression

   •    Deflate (zlib)
   •    Gzip
   •    Bzip2 – splittable with additional work, slow
   •    LZO – block based
   •    LZOP – splittable with additional work
   •    Snappy – from Google, fast, but no splittability



1/14/2013                                               12
Testing
   •    MRUnit – unit testing for Map/Reduce jobs
        http://mrunit.apache.org/
   •    Data sampling for testing
   •    Data spikes detection




1/14/2013                                           13
Small files

   •    Small files are problematic because of big
        block size
   •    Can pack them into bigger Avro files
   •    Can move to Hbase
   •    Hadoop Archives (HAR) files




1/14/2013                                            14
TOOLS


1/14/2013   15
Pig
    •    High level language for data analysis
    •    Uses PigLatin to describe data flows
         (translates into MapReduce)
    •    Filters, Joins, Projections, Groupings, Counts,
         etc.
    •    Example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)


 1/14/2013                                                               16
Hive


   •    SQL-like interface - HiveQL
   •    Has its own structure
   •    Not a pipeline like Pig
   •    Basically a distributed data warehouse
   •    Has execution optimization




1/14/2013                                        17
HBase


•    Distributed, column oriented store
•    Independent of Hadoop
•    No translation into Map/Reduce
•    Stores data in MapFiles (indexed SequenceFiles)




1/14/2013                                        18
PROVIDERS


1/14/2013      19
Apache


   •    Umbrella for Hadoop projects
   •    No commercial support
   •    Active community
   •    Most recent builds




1/14/2013                              20
Cloudera

   •    Has its own tuned build – CDH
   •    Commercial support
   •    Certification & Training
   •    Has products on top of Hadoop (like Cloudera
        Manager etc.)
   •    Very high visibility




1/14/2013                                          21
Amazon Elastic MapReduce (EMR)
   •    Custom build tailored for AWS environment
   •    Very easy
   •    Uses S3 as a storage
   •    Uses SimpleDB for job flow state information
   •    Supports HBase




1/14/2013                                              22
HortonWorks


   •    Own platform on top of Hadoop
   •    Big backers like Microsoft and Yahoo
   •    Has trainings & certification




1/14/2013                                      23
FUTURE


1/14/2013   24
Future

 •    Percolator for incremental indexing and
      analysis of frequently changing datasets
 •    Dremel for ad hoc analytics
 •    Pregel for analyzing graph data
 •    ZooKeeper & Hadoop de-coupling with new
      execution engines to the rescue!




1/14/2013                                        25
Q/A


            ?
1/14/2013       26

Mais conteúdo relacionado

Mais procurados

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
HBaseCon 2015: Just the Basics
HBaseCon 2015: Just the BasicsHBaseCon 2015: Just the Basics
HBaseCon 2015: Just the BasicsHBaseCon
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...SpringPeople
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 

Mais procurados (20)

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBaseCon 2015: Just the Basics
HBaseCon 2015: Just the BasicsHBaseCon 2015: Just the Basics
HBaseCon 2015: Just the Basics
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

Destaque

BillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalshCommunity.com_AAA Aggressive Driving Research UpdateBillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalshCommunity.com_AAA Aggressive Driving Research UpdateBillWalsh FordLincolnKia
 
Anne-Lotte Masson - The Junior Thesis
Anne-Lotte Masson - The Junior ThesisAnne-Lotte Masson - The Junior Thesis
Anne-Lotte Masson - The Junior ThesisBrussels, Belgium
 
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...Johnny Ryan
 
E marketingwerx Email Series 2
E marketingwerx Email Series 2E marketingwerx Email Series 2
E marketingwerx Email Series 2Christopher Barnes
 
Amazon Clouds in Action
Amazon Clouds in ActionAmazon Clouds in Action
Amazon Clouds in Actionzenyk
 
The anatomy of a great email christopher barnes v2
The anatomy of a great email   christopher barnes v2The anatomy of a great email   christopher barnes v2
The anatomy of a great email christopher barnes v2Christopher Barnes
 
E marketingwerx Email Learning Series 1
E marketingwerx Email Learning Series 1E marketingwerx Email Learning Series 1
E marketingwerx Email Learning Series 1Christopher Barnes
 
RMH-Process-White-Paper-v2
RMH-Process-White-Paper-v2RMH-Process-White-Paper-v2
RMH-Process-White-Paper-v2Murry Shohat
 
Slides from FIPP (global magazine association) webinar on 2016 mobile report
Slides from FIPP (global magazine association) webinar on 2016 mobile reportSlides from FIPP (global magazine association) webinar on 2016 mobile report
Slides from FIPP (global magazine association) webinar on 2016 mobile reportJohnny Ryan
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLzenyk
 
Puppet / DevOps - EDGE Lviv
Puppet / DevOps - EDGE LvivPuppet / DevOps - EDGE Lviv
Puppet / DevOps - EDGE Lvivzenyk
 
Deck from New Video Frontiers conference
Deck from New Video Frontiers conference Deck from New Video Frontiers conference
Deck from New Video Frontiers conference Johnny Ryan
 
Week 1 Dr Johnny Ryan UCD Fundamental of Management
Week 1 Dr Johnny Ryan UCD Fundamental of Management Week 1 Dr Johnny Ryan UCD Fundamental of Management
Week 1 Dr Johnny Ryan UCD Fundamental of Management Johnny Ryan
 
PageFair-DCN global stakeholders' roundtable on adblocking
PageFair-DCN global stakeholders' roundtable on adblocking PageFair-DCN global stakeholders' roundtable on adblocking
PageFair-DCN global stakeholders' roundtable on adblocking Johnny Ryan
 
Rapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache HadoopRapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache Hadoopzenyk
 
Проект Каскад
Проект КаскадПроект Каскад
Проект Каскадzenyk
 

Destaque (17)

BillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalshCommunity.com_AAA Aggressive Driving Research UpdateBillWalshCommunity.com_AAA Aggressive Driving Research Update
BillWalshCommunity.com_AAA Aggressive Driving Research Update
 
Anne-Lotte Masson - The Junior Thesis
Anne-Lotte Masson - The Junior ThesisAnne-Lotte Masson - The Junior Thesis
Anne-Lotte Masson - The Junior Thesis
 
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...
PageFair presentation at Worldwide Magazine Media Association (FIPP London 20...
 
E marketingwerx Email Series 2
E marketingwerx Email Series 2E marketingwerx Email Series 2
E marketingwerx Email Series 2
 
Brizzle cake
Brizzle cakeBrizzle cake
Brizzle cake
 
Amazon Clouds in Action
Amazon Clouds in ActionAmazon Clouds in Action
Amazon Clouds in Action
 
The anatomy of a great email christopher barnes v2
The anatomy of a great email   christopher barnes v2The anatomy of a great email   christopher barnes v2
The anatomy of a great email christopher barnes v2
 
E marketingwerx Email Learning Series 1
E marketingwerx Email Learning Series 1E marketingwerx Email Learning Series 1
E marketingwerx Email Learning Series 1
 
RMH-Process-White-Paper-v2
RMH-Process-White-Paper-v2RMH-Process-White-Paper-v2
RMH-Process-White-Paper-v2
 
Slides from FIPP (global magazine association) webinar on 2016 mobile report
Slides from FIPP (global magazine association) webinar on 2016 mobile reportSlides from FIPP (global magazine association) webinar on 2016 mobile report
Slides from FIPP (global magazine association) webinar on 2016 mobile report
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
 
Puppet / DevOps - EDGE Lviv
Puppet / DevOps - EDGE LvivPuppet / DevOps - EDGE Lviv
Puppet / DevOps - EDGE Lviv
 
Deck from New Video Frontiers conference
Deck from New Video Frontiers conference Deck from New Video Frontiers conference
Deck from New Video Frontiers conference
 
Week 1 Dr Johnny Ryan UCD Fundamental of Management
Week 1 Dr Johnny Ryan UCD Fundamental of Management Week 1 Dr Johnny Ryan UCD Fundamental of Management
Week 1 Dr Johnny Ryan UCD Fundamental of Management
 
PageFair-DCN global stakeholders' roundtable on adblocking
PageFair-DCN global stakeholders' roundtable on adblocking PageFair-DCN global stakeholders' roundtable on adblocking
PageFair-DCN global stakeholders' roundtable on adblocking
 
Rapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache HadoopRapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache Hadoop
 
Проект Каскад
Проект КаскадПроект Каскад
Проект Каскад
 

Semelhante a Hadoop Solutions

OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Saharaspinningmatt
 
Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Tesora
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2TarjeiRomtveit
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 

Semelhante a Hadoop Solutions (20)

OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
 
Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014Hadoop on OpenStack - Trove Day 2014
Hadoop on OpenStack - Trove Day 2014
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 

Mais de zenyk

Semasearch Spring - 2015
Semasearch   Spring - 2015Semasearch   Spring - 2015
Semasearch Spring - 2015zenyk
 
Ecois.me and uMuni
Ecois.me and uMuniEcois.me and uMuni
Ecois.me and uMunizenyk
 
Semasearch Intro
Semasearch IntroSemasearch Intro
Semasearch Introzenyk
 
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті державиSEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті державиzenyk
 
Introduction to Clojure - EDGE Lviv
Introduction to Clojure - EDGE LvivIntroduction to Clojure - EDGE Lviv
Introduction to Clojure - EDGE Lvivzenyk
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoopzenyk
 
Emotional Intelligence
Emotional IntelligenceEmotional Intelligence
Emotional Intelligencezenyk
 
Modern Java Web Development
Modern Java Web DevelopmentModern Java Web Development
Modern Java Web Developmentzenyk
 

Mais de zenyk (8)

Semasearch Spring - 2015
Semasearch   Spring - 2015Semasearch   Spring - 2015
Semasearch Spring - 2015
 
Ecois.me and uMuni
Ecois.me and uMuniEcois.me and uMuni
Ecois.me and uMuni
 
Semasearch Intro
Semasearch IntroSemasearch Intro
Semasearch Intro
 
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті державиSEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
SEMASEARCH - Високі технології у боротьбі з корупцією та на захисті держави
 
Introduction to Clojure - EDGE Lviv
Introduction to Clojure - EDGE LvivIntroduction to Clojure - EDGE Lviv
Introduction to Clojure - EDGE Lviv
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Emotional Intelligence
Emotional IntelligenceEmotional Intelligence
Emotional Intelligence
 
Modern Java Web Development
Modern Java Web DevelopmentModern Java Web Development
Modern Java Web Development
 

Último

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Último (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Hadoop Solutions

  • 1. Hadoop Solutions By Zenyk Matchyshyn Staff Engineer @ Lohika
  • 2. Agenda • Why? • Data in / Data out • Data Formats • Tools • Providers • Future • Q/A 1/14/2013 2
  • 3. Why? • Smart meter analysis • Genome processing • Sentiment & social media analysis • Network capacity trending & management • Ad targeting • Fraud detection 1/14/2013 3
  • 4. DATA IN / DATA OUT 1/14/2013 4
  • 5. Flume • Apache Flume is a distributed system for collecting streaming data. • Developed by Cloudera, now Apache project • Popular & supported • Features: • Centralized config • Failover • Reliability 1/14/2013 5
  • 6. Flume - Responsibilities • Node – path from source to sink • Agent – collect data from local host and forwards to Collector • Collector – collects the data and writes into HDFS • Master – manages configuration and supports data flow 1/14/2013 6
  • 7. Data in / Data out - other solutions • Scribe https://github.com/facebook/scribe – similar to Flume • Chukwa http://incubator.apache.org/chukwa/ – similar to Flume • Oozie http://oozie.apache.org/ - workflow scheduler 1/14/2013 7
  • 8. Sqoop • Apache project, originally from Cloudera http://sqoop.apache.org/ • Uses metadata to describe structure in HDFS • Transport bulk data in & out from relational database • Directly reading & writing from Map/Reduce as an alternative 1/14/2013 8
  • 10. Formats • Input and Output matter • Data in files is splitted • XML and JSON are supported • Do document per-line or suffer the consequences ;) 1/14/2013 10
  • 11. Serialization frameworks • Binary in nature, makes things a bit more complicated • Thrift & Protobuf vs SequenceFile & Avro • Native formats support splitability and compression • Avro supports code generation and versioning, just like Thrift & Protobuf • Out-of-the-box support in Hadoop 1/14/2013 11
  • 12. Compression • Deflate (zlib) • Gzip • Bzip2 – splittable with additional work, slow • LZO – block based • LZOP – splittable with additional work • Snappy – from Google, fast, but no splittability 1/14/2013 12
  • 13. Testing • MRUnit – unit testing for Map/Reduce jobs http://mrunit.apache.org/ • Data sampling for testing • Data spikes detection 1/14/2013 13
  • 14. Small files • Small files are problematic because of big block size • Can pack them into bigger Avro files • Can move to Hbase • Hadoop Archives (HAR) files 1/14/2013 14
  • 16. Pig • High level language for data analysis • Uses PigLatin to describe data flows (translates into MapReduce) • Filters, Joins, Projections, Groupings, Counts, etc. • Example: A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name; DUMP B; (John) (Mary) 1/14/2013 16
  • 17. Hive • SQL-like interface - HiveQL • Has its own structure • Not a pipeline like Pig • Basically a distributed data warehouse • Has execution optimization 1/14/2013 17
  • 18. HBase • Distributed, column oriented store • Independent of Hadoop • No translation into Map/Reduce • Stores data in MapFiles (indexed SequenceFiles) 1/14/2013 18
  • 20. Apache • Umbrella for Hadoop projects • No commercial support • Active community • Most recent builds 1/14/2013 20
  • 21. Cloudera • Has its own tuned build – CDH • Commercial support • Certification & Training • Has products on top of Hadoop (like Cloudera Manager etc.) • Very high visibility 1/14/2013 21
  • 22. Amazon Elastic MapReduce (EMR) • Custom build tailored for AWS environment • Very easy • Uses S3 as a storage • Uses SimpleDB for job flow state information • Supports HBase 1/14/2013 22
  • 23. HortonWorks • Own platform on top of Hadoop • Big backers like Microsoft and Yahoo • Has trainings & certification 1/14/2013 23
  • 25. Future • Percolator for incremental indexing and analysis of frequently changing datasets • Dremel for ad hoc analytics • Pregel for analyzing graph data • ZooKeeper & Hadoop de-coupling with new execution engines to the rescue! 1/14/2013 25
  • 26. Q/A ? 1/14/2013 26