SlideShare uma empresa Scribd logo
1 de 60
Nordic Perl Workshop 2013

Playing with Hadoop
Søren Lund (slu)
slu@369.dk
DISCLAIMER








I have no experience with Hadoop in a realworld project
The installation notes I present are not
nescessarily suitable for production
The example scripts have not been used on
real (big) data
Hence the title Playing with Hadoop
About Hadoop (and Big Data)
The Problem (it's not new)









We have (access to)
more and more data
Processing this data
takes longer and
longer
Not enough memory
Running out of disk
space
Our trusty old server
can't keep up

!!!!!
Scaling up








Upgrade hardware:
bigger and faster
Redundancy: power
supply, RAID, hotswap
Expensive to keep
scaling up
Our software will run
without modifications
Scaling out








Add more
(commodity) servers
Redundancy is
replaced by
replication
You can keep on
scaling out, it's cheap
How do we enable
our software to run
across multiple
servers?
Google solved this


Google published two papers


Google File System (GFS), 2003
http://research.google.com/archive/gfs.html






MapReduce, 2004
http://research.google.com/archive/mapreduce.html

GFS and MapReduced provided a platform for
processing huge amounts of data in an efficient
way
Hadoop was born





Doug Cutting read the Google papers
Based on those, he created Hadoop
(named after his sons toy elephant)
It is an implementation of GFS/MapReduce
(Open Source / Apache License)



Written in Java and deployed on Linux



First part of Lucene, now an Apache project



https://hadoop.apache.org/
Hadoop Components


Hadoop Common – utilities to control the rest



HDFS – Hadoop Distributed File System



YARN – Yet Another Resource Negotiator



MapReduce – YARN-based parallel processing



This enables us to write software that can
handle Big Data by scaling out
Big Data isn't just big


Huge amounts of data (volume)



Unstructured data (form)



Highly dynamic data (burst/change rate)



Big Data is actually hard-to-handle (with
traditional tools/methods) data
Examples of Big Data


Log files, i.e.





web server access logs
application logs

Internet feeds





Twitter, Facebook, etc.
RSS

Images (face recognition, tagging)
Installing Hadoop
Needed to run Hadoop


You need the following to run Hadoop



Java JDK





Linux server
Hadoop tarball

I'm using the following



JDK 1.6.24 64 bit





Ubuntu 12.04 LTS 64 bit
Hadoop 1.0.4

Could not get JDK7 + Hadoop 2.2 to work
Install Java
Setup Java home and path
Add hadoop user
Create SSH key for hadoop user
Accept SSH key
Install Hadoop and add to path
Disable IPv6
Reboot and check installation
Running an example job
Calculate Pi
Estimated value of Pi
Three modes of operation


Pi was calculated in Local standalone mode





it is the default mode (i.e. no configuration needed)
all components of Hadoop run in a single JVM

Pseudo-distributed mode



components communicate using sockets





a separate JVM is spawned for each component
it is a mini-cluster on a single host

Fully distributed mode


components are spread across multiple machines
Create base directory for HDFS
Set JAVA_HOME
Edit core-site.xml
Edit hdfs-site.xml
Edit mapred-site.xml
Log out and log on as hadoop
Format HDFS
Start HDFS
Start Map Reduce
Create home directory & test data
Running Word Count
First let's try the example jar
Inspect the result
Compile and run our own jar
https://gist.github.com/soren/7213273
Inspect result
Run improved version
https://gist.github.com/soren/7213453
Inspect (improved) result
Hadoop MapReduce





A reducer will get all values associated with a
given key
Precursor job can be used to normalize data
Combiners can be used to perform early sorting
of map output before it is send to the reducer
Perl MapReduce
Playing with MapReduce



We don't need Hadoop to play with MapReduce
Instead we can emulate Hadoop using two
scripts



wc_mapper.pl – a Word Count Mapper



wc_reducer.pl – a Word Count Reducer



We connect them using a pipe (|)



Very Unix-like!
Run MapReduce without Hadoop
https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285
Hadoop's Streaming interface


Enables you to write jobs in any programming
language, e.g. Perl



Input from STDIN



Output to STDOUT



Key/Value pairs separated by TAB



Reducers will get values one-by-one



Not to be confused with Hadoop Pipes that
provides a native C++ interface to Hadoop
Run Perl Word Count
https://gist.github.com/soren/7596270

https://gist.github.com/soren/7596285
Inspect result
Hadoop::Streaming


Perl interface to Hadoop's Streaming interface



Implemented in Moose



You'll can now implement you MapReduce as


a class with a map() and reduce() method



a mapper script



a reducer script
Installing Hadoop::Streaming


Btw, Perl was already installed on the server ;-)



But we want to install Hadoop::Streaming



I also had to install local::lib to make it work



All you have to do is
sudo cpan local::lib Hadoop::Streaming



Nice and easy
Run Hadoop::Streaming job
https://gist.github.com/soren/7596451
https://gist.github.com/soren/7600134 https://gist.github.com/soren/7600144
Inspect result
Some final notes and loose ends
The Web User Interface


HDFS




MapReduce




http://localhost:8070/

File Browser




http://localhost:8030/

http://localhost:8075/browseDirectory.jsp?namenodeInfo

Note: this is with port forwarding in VirtualBox


50030 → 8030, 50070 → 8070, 50075 → 8075
Joins in Hadoop


It's possible to implement joins in MapReduce





Reduce-joins – simple
Map-joins – less data to transfer

Do you need joins?


Maybe you're data has structure → SQL?



Try Hive (HiveQL)



Or Pig (Pig Latin)
Hadoop in the Cloud


Elastic MapReduce (EMR)
http://aws.amazon.com/elasticmapreduce/



Essentially Hadoop in the Cloud



Build on EC2 and S3



You can upload JARs or scripts
There's more


Distributions








Cloudera Distribution for Hadoop (CDH)
http://www.cloudera.com/
Hortonworks Data Platform (HDP)
http://hortonworks.com/

HBase, Hive, Pig and other related projects
https://hadoop.apache.org/
But, a basic Hadoop setup is a good start


and a nice place to just play with Hadoop
I like big data and I can not lie
Oh, my God, Becky, look at the data, it's so big
It looks like one of those Hadoop guys setups
Who understands those Hadoop guys
They only map/reduce it because it is on a
distributed file system
I mean the data, it's just so big
I can't believe it's so huge
It's just out there, I mean, it's gross
Look, it's just so blah
The End

Questions?

Slides will be available at http://www.slideshare.net/slu/
Find me on Twitter https://twitter.com/slu

Mais conteúdo relacionado

Mais procurados

Tips for a Faster Website
Tips for a Faster WebsiteTips for a Faster Website
Tips for a Faster Website
Rayed Alrashed
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 

Mais procurados (20)

Frontend JS workflow - Gulp 4 and the like
Frontend JS workflow - Gulp 4 and the likeFrontend JS workflow - Gulp 4 and the like
Frontend JS workflow - Gulp 4 and the like
 
Dev ops meetup
Dev ops meetupDev ops meetup
Dev ops meetup
 
Fun with containers: Use Ansible to build Docker images
Fun with containers: Use Ansible to build Docker imagesFun with containers: Use Ansible to build Docker images
Fun with containers: Use Ansible to build Docker images
 
Create a Varnish cluster in Kubernetes for Drupal caching - DrupalCon North A...
Create a Varnish cluster in Kubernetes for Drupal caching - DrupalCon North A...Create a Varnish cluster in Kubernetes for Drupal caching - DrupalCon North A...
Create a Varnish cluster in Kubernetes for Drupal caching - DrupalCon North A...
 
Tips for a Faster Website
Tips for a Faster WebsiteTips for a Faster Website
Tips for a Faster Website
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Ansible : what's ansible & use case by REX
Ansible :  what's ansible & use case by REXAnsible :  what's ansible & use case by REX
Ansible : what's ansible & use case by REX
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
 
Ansible Meetup Hamburg / Quickstart
Ansible Meetup Hamburg / QuickstartAnsible Meetup Hamburg / Quickstart
Ansible Meetup Hamburg / Quickstart
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
 
Breaking Up With Your Data Center Presentation
Breaking Up With Your Data Center PresentationBreaking Up With Your Data Center Presentation
Breaking Up With Your Data Center Presentation
 
Automating your workflow with Gulp.js
Automating your workflow with Gulp.jsAutomating your workflow with Gulp.js
Automating your workflow with Gulp.js
 
Puppet Camp Dallas 2014: Puppet Keynote
Puppet Camp Dallas 2014: Puppet Keynote Puppet Camp Dallas 2014: Puppet Keynote
Puppet Camp Dallas 2014: Puppet Keynote
 
Getting started with gulpjs
Getting started with gulpjsGetting started with gulpjs
Getting started with gulpjs
 
Cyansible
CyansibleCyansible
Cyansible
 
Terraform + ansible talk
Terraform + ansible talkTerraform + ansible talk
Terraform + ansible talk
 
Ansible best practices
Ansible best practicesAnsible best practices
Ansible best practices
 
Drupal VM for Drupal 8 Dev - Drupal Camp STL 2017
Drupal VM for Drupal 8 Dev - Drupal Camp STL 2017Drupal VM for Drupal 8 Dev - Drupal Camp STL 2017
Drupal VM for Drupal 8 Dev - Drupal Camp STL 2017
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
Dockerizing Windows Server Applications by Ender Barillas and Taylor Brown
Dockerizing Windows Server Applications by Ender Barillas and Taylor BrownDockerizing Windows Server Applications by Ender Barillas and Taylor Brown
Dockerizing Windows Server Applications by Ender Barillas and Taylor Brown
 

Destaque

Beyond Unit Testing
Beyond Unit TestingBeyond Unit Testing
Beyond Unit Testing
Søren Lund
 
Baratie - 사회적식당 만들기
Baratie - 사회적식당 만들기Baratie - 사회적식당 만들기
Baratie - 사회적식당 만들기
DongKyun Lee
 
Apache JMeter Introduction
Apache JMeter IntroductionApache JMeter Introduction
Apache JMeter Introduction
Søren Lund
 
Wireless robot ppt
Wireless robot pptWireless robot ppt
Wireless robot ppt
Varun B P
 
[090723]Web2.0to Sns
[090723]Web2.0to Sns[090723]Web2.0to Sns
[090723]Web2.0to Sns
DongKyun Lee
 

Destaque (20)

Beyond Unit Testing
Beyond Unit TestingBeyond Unit Testing
Beyond Unit Testing
 
Baratie - 사회적식당 만들기
Baratie - 사회적식당 만들기Baratie - 사회적식당 만들기
Baratie - 사회적식당 만들기
 
2010 Dblab신년회 안명환
2010 Dblab신년회 안명환2010 Dblab신년회 안명환
2010 Dblab신년회 안명환
 
투자하기 좋은 기업 고르는 법
투자하기 좋은 기업 고르는 법투자하기 좋은 기업 고르는 법
투자하기 좋은 기업 고르는 법
 
Web to sns
Web to snsWeb to sns
Web to sns
 
[100621]제안발표
[100621]제안발표[100621]제안발표
[100621]제안발표
 
Ignite D Blab 1st (윤태현)
Ignite D Blab 1st (윤태현)Ignite D Blab 1st (윤태현)
Ignite D Blab 1st (윤태현)
 
연인 관계를 오래 지속하는 법
연인 관계를 오래 지속하는 법연인 관계를 오래 지속하는 법
연인 관계를 오래 지속하는 법
 
Te dx kgu소개
Te dx kgu소개Te dx kgu소개
Te dx kgu소개
 
[100129]나눔문화와 소셜네트워크
[100129]나눔문화와 소셜네트워크[100129]나눔문화와 소셜네트워크
[100129]나눔문화와 소셜네트워크
 
Documenting code yapceu2016
Documenting code yapceu2016Documenting code yapceu2016
Documenting code yapceu2016
 
Documenting Code - Patterns and Anti-patterns - NLPW 2016
Documenting Code - Patterns and Anti-patterns - NLPW 2016Documenting Code - Patterns and Anti-patterns - NLPW 2016
Documenting Code - Patterns and Anti-patterns - NLPW 2016
 
Apache JMeter Introduction
Apache JMeter IntroductionApache JMeter Introduction
Apache JMeter Introduction
 
世界一周のしおり
世界一周のしおり世界一周のしおり
世界一周のしおり
 
발표자료 해성
발표자료 해성발표자료 해성
발표자료 해성
 
introduce myself
introduce myselfintroduce myself
introduce myself
 
Kwon성격유형
Kwon성격유형Kwon성격유형
Kwon성격유형
 
Flying robot
Flying robot Flying robot
Flying robot
 
Wireless robot ppt
Wireless robot pptWireless robot ppt
Wireless robot ppt
 
[090723]Web2.0to Sns
[090723]Web2.0to Sns[090723]Web2.0to Sns
[090723]Web2.0to Sns
 

Semelhante a Playing with Hadoop (NPW2013)

Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainer
sriram0233
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
Salil Navgire
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
Doug Chang
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
Samatha Kamuni
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
Samatha Kamuni
 

Semelhante a Playing with Hadoop (NPW2013) (20)

DC HUG Hadoop for Windows
DC HUG Hadoop for WindowsDC HUG Hadoop for Windows
DC HUG Hadoop for Windows
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
 
Hadoop
HadoopHadoop
Hadoop
 
Unit 5
Unit  5Unit  5
Unit 5
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainer
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Best Hadoop and Amazon Online Training
Best Hadoop and Amazon Online TrainingBest Hadoop and Amazon Online Training
Best Hadoop and Amazon Online Training
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

Playing with Hadoop (NPW2013)