SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Hadoop and
MapReduce
                               Friso van Vollenhoven
                               fvanvollenhoven@xebia.com


The workings of the elephant
Data everywhere

‣ Global data volume grows exponentially
‣ Information retrieval is BIG business these days
‣ Need means of economically storing and processing large data sets
Opportunity

‣ Commodity hardware is ultra cheap
‣ CPU and storage even cheaper
Traditional solution

‣ Store data in a (relational) database
‣ Run batch jobs for processing
Problems with existing solutions

‣ Databases are seek heavy; B-tree gives log(n) random accesses per update
‣ Seeks are wasted time, nothing of value happens during seeks
‣ Databases do not play well with commoditized hardware (SANs and 16 CPU
    machines are not in the price sweet spot of performance / $)
‣   Databases were not built with horizontal scaling in mind
Solution: sort/merge vs. updating the B-tree

‣   Eliminate the seeks, only sequential reading / writing
‣   Work with batches for efficiency
‣   Parallelize work load
‣   Distribute processing and storage
History

‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index
‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge
    optimization applies
‣   2004: Google publishes GFS and MapReduce papers
‣   2006: Apache Hadoop: open source Java implementation of GFS and MR to solve
    Nutch’ problem; later becomes standalone project
‣   2011: We’re here learning about it!
Hadoop foundations

‣   Commodity hardware (3K - 7K $ machines)
‣   Only sequential reads / writes
‣   Distribution of data and processing across cluster
‣   Built in reliability / fault tolerance / redundancy
‣   Disk based, does not require data or indexes to fit in RAM
‣   Apache licensed, Open Source Software
The US government
builds their finger print
search index using
Hadoop.
The contents for the People You May Know feature is
created by a chain of many MapReduce jobs that
run daily. The jobs are reportedly a combination of
graph traversal, clustering and assisted machine
learning.
Amazon’s Frequently Bought Together and Customers Who Bought This Item Also
Bought features are brought to you by MapReduce jobs. Recommendation
based on large sales transaction datasets is a much seen use case.
Top Charts
generated daily
based on millions
of users’ listening
behavior.
Top searches used for auto-completion are re-generated daily by a
MapReduce job using all searches for the past couple of days.
Popularity for search terms can be based on counts, but also trending
and correlation with other datasets (e.g. trending on social media,
news, charts in case of music and movies, best seller lists, etc.)
What is Hadoop
Hadoop
Filesystem
             Friso van Vollenhoven
             fvanvollenhoven@xebia.com


HDFS
HDFS overview

‣   Distributed filesystem
‣   Consists of a single master node and multiple (many) data nodes
‣   Files are split up blocks (typically 64MB)
‣   Blocks are spread across data nodes in the cluster
‣   Each block is replicated multiple times to different data nodes in the cluster
    (typically 3 times)
‣   Master node keeps track of which blocks belong to a file
HDFS interaction

‣   Accessible through Java API
‣   FUSE (filesystem in user space) driver available to mount as regular FS
‣   C API available
‣   Basic command line tools in Hadoop distribution
‣   Web interface
HDFS interaction

‣ File creation, directory listing and other meta data actions go through the master
    node (e.g. ls, du, fsck, create file)
‣   Data goes directly to and from data nodes (read, write, append)
‣   Local read path optimization: clients located on same machine as data node will
    always access local replica when possible
Hadoop FileSystem (HDFS)
                                                                              Name Node


                                                                 /some/file               /foo/bar
       HDFS client
                                 create file




                                                                                                                 read data
                                              Date Node                   Date Node                  Date Node
                            write data
                                                DISK                          DISK                     DISK



                                                                                                                  Node local
                                                                                                                  HDFS client
                                                DISK                          DISK                     DISK




                                                          replicate
                                                DISK                          DISK                     DISK




                     read data
HDFS daemons: NameNode

‣   Filesystem master node
‣   Keeps track of directories, files and block locations
‣   Assigns blocks to data nodes
‣   Keeps track of live nodes (through heartbeats)
‣   Initiates re-replication in case of data node loss

‣ Block meta data is held in memory
  • Will run out of memory when too many files exist
‣ Is a SINGLE POINT OF FAILURE in the system
  • Some solutions exist
HDFS daemons: DataNode

‣ Filesystem worker node / “Block server”
‣ Uses underlying regular FS for storage (e.g. ext3)
  • Takes care of distribution of blocks across disks
  • Don’t use RAID
  • More disks means more IO throughput
‣ Sends heartbeats to NameNode
‣ Reports blocks to NameNode (on startup)
‣ Does not know about the rest of the cluster (shared nothing)
Things to know about HDFS

‣ HDFS is write once, read many
  • But has append support in newer versions
‣ Has built in compression at the block level
‣ Does end-to-end checksumming on all data
‣ Has tools for parallelized copying of large amounts of data to other HDFS
    clusters (distcp)
‣   Provides a convenient file format to gather lots of small files into a single large
    one
    • Remember the NameNode running out of memory with too many files?
‣ HDFS is best used for large, unstructured volumes of raw data in BIG files used
    for batch operations
    • Optimized for sequential reads, not random access
Hadoop Sequence Files

‣   Special type of file to store Key-Value pairs
‣   Stores keys and values as byte arrays
‣   Uses length encoded bytes as format
‣   Often used as input or output format for MapReduce jobs
‣   Has built in compression on values
Example: command directory listing



friso@fvv:~/java$ hadoop fs -ls /
Found 3 items
drwxr-xr-x    - friso supergroup    0 2011-03-31 17:06 /Users
drwxr-xr-x    - friso supergroup    0 2011-03-16 14:16 /hbase
drwxr-xr-x    - friso supergroup    0 2011-04-18 11:33 /user
friso@fvv:~/java$
Example: NameNode web interface
Example: copy local file to HDFS




friso@fvv:~/Downloads$ hadoop fs -put ./some-tweets.json tweets-data.json
MapReduce

                           Friso van Vollenhoven
                           fvanvollenhoven@xebia.com

Massively parallelizable
computing
MapReduce, the algorithm
   Input data:             Required output:
Map: extract something useful from each record
                 KEYS   VALUES


           map
                                 void map(recordNumber, record) {
                                   key = record.findColorfulShape();
           map                     value = record.findGrayShapes();
                                   emit(key, value);
           map                   }
           map



           map



           map



           map



           map
Framework sorts all KeyValue pairs by Key
                KEYS   VALUES   KEYS   VALUES
Reduce: process values for each key
KEYS   VALUES              KEYS   VALUES



                reduce




                  reduce




                                   void reduce(key, values) {
                  reduce             allGrayShapes = [];
                                     foreach (value in values) {
                                       allGrayShapes.push(value);
                                     }
                                     emit(key, allGrayShapes);
                                   }
MapReduce, the algorithm

               KEYS   VALUES   KEYS   VALUES              KEYS   VALUES

        map
                                               reduce

         map



        map

                                                 reduce
        map



         map
                                                 reduce

         map



        map



         map
Hadoop MapReduce: parallelized on top of HDFS

‣ Job input comes from files on HDFS
  • Typically sequence files
  • Other formats are possible; requires specialized InputFormat implementation
  • Built in support for text files (convenient for logs, csv, etc.)
  • Files must be splittable for parallelization to work
    - Not all compression formats have this property (e.g. gzip)
MapReduce daemons: JobTracker

‣   MapReduce master node
‣   Takes care of scheduling and job submission
‣   Splits jobs into tasks (Mappers and Reducers)
‣   Assigns tasks to worker nodes
‣   Reassigns tasks in case of failure
‣   Keeps track of job progress
‣   Keeps track of worker nodes through heartbeats
MapReduce daemons: TaskTracker

‣   MapReduce worker process
‣   Starts Mappers en Reducers assigned by JobTracker
‣   Sends heart beats to the JobTracker
‣   Sends task progress to the JobTracker
‣   Does not know about the rest of the cluster (shared nothing)
Hadoop MapReduce: parallelized on top of HDFS
Hadoop MapReduce: Mapper side

‣ Each mapper processes a piece of the total input
  • Typically blocks that reside on the same machine as the mapper (local
      datanode)
‣   Mappers sort output by key and store it on the local disk
    • If the mapper output does not fit in RAM, on disk merge sort happens
Hadoop MapReduce: Reducer side

‣ Reducers collect sorted input KeyValue pairs over the network from Mappers
  • Reducer performs (on disk) merge on inputs from different mappers
‣ Reducer calls the reduce method for each unique key
  • List of values for each key is read from local disk (the result of the merge)
  • Values do not need to fit in RAM
    - Reduce methods that need a global view, need enough RAM to fit all values
       for a key

‣ Reducer writes output KeyValue pairs to HDFS
  • Typically blocks go to local data node
Hadoop MapReduce: parallelized on top of HDFS
<PLUG>
                           Summer Classes
   Big data crunching using Hadoop and other NoSQL tools
   •   Write Hadoop MapReduce jobs in Java
   •   Run on a actual cluster pre-loaded with several datasets
   •   Create a simple application or visualization with the result
   •   Learn about Hadoop without the hassle of building a production cluster first
   •   Have lots of fun!

                 Dates: July 12, August 10
             Only € 295,= for a full day course
                   http://www.xebia.com/summerclasses/bigdata

                                                                            </PLUG>

Mais conteúdo relacionado

Mais procurados

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

Mais procurados (16)

2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Cppt
CpptCppt
Cppt
 

Destaque

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopGetInData
 

Destaque (20)

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop
hadoophadoop
hadoop
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 

Semelhante a Hadoop, HDFS and MapReduce

Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialPranamesh Chakraborty
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 

Semelhante a Hadoop, HDFS and MapReduce (20)

Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorial
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
MapReduce1.pptx
MapReduce1.pptxMapReduce1.pptx
MapReduce1.pptx
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 

Mais de fvanvollenhoven

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!fvanvollenhoven
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collectorfvanvollenhoven
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentationfvanvollenhoven
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jfvanvollenhoven
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksfvanvollenhoven
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshopfvanvollenhoven
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 

Mais de fvanvollenhoven (9)

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collector
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4j
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networks
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshop
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Berlin Buzzwords preso
Berlin Buzzwords presoBerlin Buzzwords preso
Berlin Buzzwords preso
 

Último

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Último (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Hadoop, HDFS and MapReduce

  • 1. Hadoop and MapReduce Friso van Vollenhoven fvanvollenhoven@xebia.com The workings of the elephant
  • 2. Data everywhere ‣ Global data volume grows exponentially ‣ Information retrieval is BIG business these days ‣ Need means of economically storing and processing large data sets
  • 3. Opportunity ‣ Commodity hardware is ultra cheap ‣ CPU and storage even cheaper
  • 4. Traditional solution ‣ Store data in a (relational) database ‣ Run batch jobs for processing
  • 5. Problems with existing solutions ‣ Databases are seek heavy; B-tree gives log(n) random accesses per update ‣ Seeks are wasted time, nothing of value happens during seeks ‣ Databases do not play well with commoditized hardware (SANs and 16 CPU machines are not in the price sweet spot of performance / $) ‣ Databases were not built with horizontal scaling in mind
  • 6. Solution: sort/merge vs. updating the B-tree ‣ Eliminate the seeks, only sequential reading / writing ‣ Work with batches for efficiency ‣ Parallelize work load ‣ Distribute processing and storage
  • 7. History ‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index ‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge optimization applies ‣ 2004: Google publishes GFS and MapReduce papers ‣ 2006: Apache Hadoop: open source Java implementation of GFS and MR to solve Nutch’ problem; later becomes standalone project ‣ 2011: We’re here learning about it!
  • 8. Hadoop foundations ‣ Commodity hardware (3K - 7K $ machines) ‣ Only sequential reads / writes ‣ Distribution of data and processing across cluster ‣ Built in reliability / fault tolerance / redundancy ‣ Disk based, does not require data or indexes to fit in RAM ‣ Apache licensed, Open Source Software
  • 9.
  • 10. The US government builds their finger print search index using Hadoop.
  • 11.
  • 12. The contents for the People You May Know feature is created by a chain of many MapReduce jobs that run daily. The jobs are reportedly a combination of graph traversal, clustering and assisted machine learning.
  • 13.
  • 14. Amazon’s Frequently Bought Together and Customers Who Bought This Item Also Bought features are brought to you by MapReduce jobs. Recommendation based on large sales transaction datasets is a much seen use case.
  • 15.
  • 16. Top Charts generated daily based on millions of users’ listening behavior.
  • 17.
  • 18. Top searches used for auto-completion are re-generated daily by a MapReduce job using all searches for the past couple of days. Popularity for search terms can be based on counts, but also trending and correlation with other datasets (e.g. trending on social media, news, charts in case of music and movies, best seller lists, etc.)
  • 20. Hadoop Filesystem Friso van Vollenhoven fvanvollenhoven@xebia.com HDFS
  • 21. HDFS overview ‣ Distributed filesystem ‣ Consists of a single master node and multiple (many) data nodes ‣ Files are split up blocks (typically 64MB) ‣ Blocks are spread across data nodes in the cluster ‣ Each block is replicated multiple times to different data nodes in the cluster (typically 3 times) ‣ Master node keeps track of which blocks belong to a file
  • 22. HDFS interaction ‣ Accessible through Java API ‣ FUSE (filesystem in user space) driver available to mount as regular FS ‣ C API available ‣ Basic command line tools in Hadoop distribution ‣ Web interface
  • 23. HDFS interaction ‣ File creation, directory listing and other meta data actions go through the master node (e.g. ls, du, fsck, create file) ‣ Data goes directly to and from data nodes (read, write, append) ‣ Local read path optimization: clients located on same machine as data node will always access local replica when possible
  • 24. Hadoop FileSystem (HDFS) Name Node /some/file /foo/bar HDFS client create file read data Date Node Date Node Date Node write data DISK DISK DISK Node local HDFS client DISK DISK DISK replicate DISK DISK DISK read data
  • 25. HDFS daemons: NameNode ‣ Filesystem master node ‣ Keeps track of directories, files and block locations ‣ Assigns blocks to data nodes ‣ Keeps track of live nodes (through heartbeats) ‣ Initiates re-replication in case of data node loss ‣ Block meta data is held in memory • Will run out of memory when too many files exist ‣ Is a SINGLE POINT OF FAILURE in the system • Some solutions exist
  • 26. HDFS daemons: DataNode ‣ Filesystem worker node / “Block server” ‣ Uses underlying regular FS for storage (e.g. ext3) • Takes care of distribution of blocks across disks • Don’t use RAID • More disks means more IO throughput ‣ Sends heartbeats to NameNode ‣ Reports blocks to NameNode (on startup) ‣ Does not know about the rest of the cluster (shared nothing)
  • 27. Things to know about HDFS ‣ HDFS is write once, read many • But has append support in newer versions ‣ Has built in compression at the block level ‣ Does end-to-end checksumming on all data ‣ Has tools for parallelized copying of large amounts of data to other HDFS clusters (distcp) ‣ Provides a convenient file format to gather lots of small files into a single large one • Remember the NameNode running out of memory with too many files? ‣ HDFS is best used for large, unstructured volumes of raw data in BIG files used for batch operations • Optimized for sequential reads, not random access
  • 28. Hadoop Sequence Files ‣ Special type of file to store Key-Value pairs ‣ Stores keys and values as byte arrays ‣ Uses length encoded bytes as format ‣ Often used as input or output format for MapReduce jobs ‣ Has built in compression on values
  • 29. Example: command directory listing friso@fvv:~/java$ hadoop fs -ls / Found 3 items drwxr-xr-x - friso supergroup 0 2011-03-31 17:06 /Users drwxr-xr-x - friso supergroup 0 2011-03-16 14:16 /hbase drwxr-xr-x - friso supergroup 0 2011-04-18 11:33 /user friso@fvv:~/java$
  • 31. Example: copy local file to HDFS friso@fvv:~/Downloads$ hadoop fs -put ./some-tweets.json tweets-data.json
  • 32. MapReduce Friso van Vollenhoven fvanvollenhoven@xebia.com Massively parallelizable computing
  • 33. MapReduce, the algorithm Input data: Required output:
  • 34. Map: extract something useful from each record KEYS VALUES map void map(recordNumber, record) { key = record.findColorfulShape(); map value = record.findGrayShapes(); emit(key, value); map } map map map map map
  • 35. Framework sorts all KeyValue pairs by Key KEYS VALUES KEYS VALUES
  • 36. Reduce: process values for each key KEYS VALUES KEYS VALUES reduce reduce void reduce(key, values) { reduce allGrayShapes = []; foreach (value in values) { allGrayShapes.push(value); } emit(key, allGrayShapes); }
  • 37. MapReduce, the algorithm KEYS VALUES KEYS VALUES KEYS VALUES map reduce map map reduce map map reduce map map map
  • 38. Hadoop MapReduce: parallelized on top of HDFS ‣ Job input comes from files on HDFS • Typically sequence files • Other formats are possible; requires specialized InputFormat implementation • Built in support for text files (convenient for logs, csv, etc.) • Files must be splittable for parallelization to work - Not all compression formats have this property (e.g. gzip)
  • 39. MapReduce daemons: JobTracker ‣ MapReduce master node ‣ Takes care of scheduling and job submission ‣ Splits jobs into tasks (Mappers and Reducers) ‣ Assigns tasks to worker nodes ‣ Reassigns tasks in case of failure ‣ Keeps track of job progress ‣ Keeps track of worker nodes through heartbeats
  • 40. MapReduce daemons: TaskTracker ‣ MapReduce worker process ‣ Starts Mappers en Reducers assigned by JobTracker ‣ Sends heart beats to the JobTracker ‣ Sends task progress to the JobTracker ‣ Does not know about the rest of the cluster (shared nothing)
  • 42. Hadoop MapReduce: Mapper side ‣ Each mapper processes a piece of the total input • Typically blocks that reside on the same machine as the mapper (local datanode) ‣ Mappers sort output by key and store it on the local disk • If the mapper output does not fit in RAM, on disk merge sort happens
  • 43. Hadoop MapReduce: Reducer side ‣ Reducers collect sorted input KeyValue pairs over the network from Mappers • Reducer performs (on disk) merge on inputs from different mappers ‣ Reducer calls the reduce method for each unique key • List of values for each key is read from local disk (the result of the merge) • Values do not need to fit in RAM - Reduce methods that need a global view, need enough RAM to fit all values for a key ‣ Reducer writes output KeyValue pairs to HDFS • Typically blocks go to local data node
  • 45. <PLUG> Summer Classes Big data crunching using Hadoop and other NoSQL tools • Write Hadoop MapReduce jobs in Java • Run on a actual cluster pre-loaded with several datasets • Create a simple application or visualization with the result • Learn about Hadoop without the hassle of building a production cluster first • Have lots of fun! Dates: July 12, August 10 Only € 295,= for a full day course http://www.xebia.com/summerclasses/bigdata </PLUG>