SlideShare a Scribd company logo
1 of 13
Implementing
Hadoop on a Single
Cluster
- S A L IL NAVG IR E
Basic Setup
1.

Install Ubuntu

2.

Install Java, Python and update

3.

Add group ‘hadoop’ and ‘hduser’ as user (for security and
backup)

4.

Configure SSH
a)
b)

Configure it by editing file ssh_config and save a backup

c)

Generate ssh key for hduser

d)

Enable ssh access to your local machine with the newly created RSA
key

e)

5.

Install OpenSSH Server

hduser@Ubuntu:~$ ssh localhost

Disable IPv6 in sysctl.conf file in editor
Installing Hadoop
1. Download hadoop from the collection of Apache
Download Mirrors
• salil@ubuntu:/usr/local$ sudo tar xzf hadoop-2.0.6-alphasrc.tar.gz

2. Make sure to change the owner to hduser in
hadoop group
• $ sudo chown -R hduser:hadoop hadoop (change the
permissions)

3. Update $HOME/.bashrc – hadoop related
environment variables
Configuration
1. Edit environment variables in conf/hadoop-env.sh
2. Change settings in conf/*site.xml
3. Make directory and set the required ownerships and
permissions
• Now we create the directory and set the required ownerships
and permissions:
• $ sudo mkdir -p /app/hadoop/tmp
• $ sudo chown hduser:hadoop /app/hadoop/tmp
• $ sudo chmod 750 /app/hadoop/tmp

4. Add configurations snippets between <configuration>
... </configuration> tags in core-site.xml, mapredsite.xml and hdfs-site.xml
Starting your single node cluster
• First format the namenode
•

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop
namenode -format

• Start your single node cluster
• Running a MapReduce job
• Download data and copy from local file to hdfs
• hduser@ubuntu:~$ hadoop dfs -copyFromLocal
/home/hduser/project.txt /user/new
• hduser@ubuntu:~$ hadoop dfs -copyFromLocal
/home/hduser/hadoop/project.txt /user/lol
• hduser@ubuntu:~$ hadoop dfs -ls /user/lol
Found 2 items
drwxr-xr-x - hduser supergroup
0 2013-10-10
06:30 /user/lol/output
-rw-r--r-- 1 hduser supergroup 969039 2013-1005 20:20 /user/lol/project.txt
• hduser@ubuntu:~$ hadoop jar
/home/hduser/hadoop/hadoop-examples-1.0.3.jar
wordcount /user/lol/project.txt /user/lol/output/
• Hadoop Web interfaces
• http://localhost:50070/ – web UI of the NameNode daemon
• http://localhost:50030/ – web UI of the JobTracker daemon
• http://localhost:50060/ – web UI of the TaskTracker daemon
• The NameNode
Web interface gives
us a cluster
summary about
total /remaining,
capacity, live and
dead nodes.
• Aditionally we can
browse the HDFS to
view contents of
files and log
• The Jobtracker
Web interface
provides general
job statistics
about Hadoop
cluster,
running/complet
ed/failed jobs
and a job history
log file
• Tasktracker
provides info
about running
and non-running
tasks
Writing MapReduce programs
• Hadoop framework is written in java, which is
complicated to code for Non-CS guys
• Can be written in Python and converted to .jar file using
Jython to run on a Hadoop cluster

• But Jython has incomplete standard library because
some Python features not provided in Jython
• Alternative is to use Hadoop Streaming

• Hadoop streaming is the utility that comes with Hadoop
distribution; able to run any executable script as a
mapper and reducer
• Write mapper.py and reducer.py in python
• Download and copy data to HDFS

• Run same as previous java implementation
• There are other third party solutions of Python
Mapreduce which are similar to Streaming/Jython
but can be easily used as a library in Python
Python implementation stratagies
• Streaming
• mrjob
• dumbo
• Hadoopy

• Non-Hadoop
• disco

• Prefer Hadoop streaming if possible because it is
easy and has the lowest overhead
• Prefer mrjob where you need higher abstraction
and integration with AWS
Future Work….
• Python implementation in Hadoop
• Running Hadoop in Multi node cluster
• Pig and its implementation on linux
• Apache Mahout, Hive, Solr

More Related Content

What's hot

8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with DockerFabio Fumarola
 
An example Hadoop Install
An example Hadoop InstallAn example Hadoop Install
An example Hadoop InstallMike Frampton
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentFarzad Nozarian
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Nag Arvind Gudiseva
 
Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Mike Frampton
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Install hadoop in a cluster
Install hadoop in a clusterInstall hadoop in a cluster
Install hadoop in a clusterXuhong Zhang
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab AssignmentFarzad Nozarian
 
Boulder dev ops-meetup-11-2012-rundeck
Boulder dev ops-meetup-11-2012-rundeckBoulder dev ops-meetup-11-2012-rundeck
Boulder dev ops-meetup-11-2012-rundeckWill Sterling
 
Beeswax Hive editor in Hue
Beeswax Hive editor in HueBeeswax Hive editor in Hue
Beeswax Hive editor in HueRomain Rigaux
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Configuringahadoop
ConfiguringahadoopConfiguringahadoop
Configuringahadoopmensb
 

What's hot (17)

8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
An example Hadoop Install
An example Hadoop InstallAn example Hadoop Install
An example Hadoop Install
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
 
Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Install hadoop in a cluster
Install hadoop in a clusterInstall hadoop in a cluster
Install hadoop in a cluster
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
 
Boulder dev ops-meetup-11-2012-rundeck
Boulder dev ops-meetup-11-2012-rundeckBoulder dev ops-meetup-11-2012-rundeck
Boulder dev ops-meetup-11-2012-rundeck
 
Beeswax Hive editor in Hue
Beeswax Hive editor in HueBeeswax Hive editor in Hue
Beeswax Hive editor in Hue
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 
Configuringahadoop
ConfiguringahadoopConfiguringahadoop
Configuringahadoop
 
Pig
PigPig
Pig
 
Run wordcount job (hadoop)
Run wordcount job (hadoop)Run wordcount job (hadoop)
Run wordcount job (hadoop)
 

Viewers also liked

Salil presentation 11.07
Salil presentation 11.07Salil presentation 11.07
Salil presentation 11.07Salil Navgire
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and HadoopSalil Navgire
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation SystemsSalil Navgire
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopDATAVERSITY
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosPradeep Kumar
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 

Viewers also liked (20)

Salil presentation 11.07
Salil presentation 11.07Salil presentation 11.07
Salil presentation 11.07
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation Systems
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 

Similar to Implementing Hadoop on a single cluster

02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Mandakini Kumari
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation Mahantesh Angadi
 
Hadoop single node setup
Hadoop single node setupHadoop single node setup
Hadoop single node setupMohammad_Tariq
 
DC HUG Hadoop for Windows
DC HUG Hadoop for WindowsDC HUG Hadoop for Windows
DC HUG Hadoop for WindowsTerry Padgett
 
Cloudera hadoop installation
Cloudera hadoop installationCloudera hadoop installation
Cloudera hadoop installationSumitra Pundlik
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installationmellempudilavanya999
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Hadoop cluster 安裝
Hadoop cluster 安裝Hadoop cluster 安裝
Hadoop cluster 安裝recast203
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDataWorks Summit
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase clientShashwat Shriparv
 
LuisRodriguezLocalDevEnvironmentsDrupalOpenDays
LuisRodriguezLocalDevEnvironmentsDrupalOpenDaysLuisRodriguezLocalDevEnvironmentsDrupalOpenDays
LuisRodriguezLocalDevEnvironmentsDrupalOpenDaysLuis Rodríguez Castromil
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117exsuns
 

Similar to Implementing Hadoop on a single cluster (20)

02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation
 
Hadoop single node setup
Hadoop single node setupHadoop single node setup
Hadoop single node setup
 
DC HUG Hadoop for Windows
DC HUG Hadoop for WindowsDC HUG Hadoop for Windows
DC HUG Hadoop for Windows
 
Cloudera hadoop installation
Cloudera hadoop installationCloudera hadoop installation
Cloudera hadoop installation
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installation
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Exp-3.pptx
Exp-3.pptxExp-3.pptx
Exp-3.pptx
 
Hadoop cluster 安裝
Hadoop cluster 安裝Hadoop cluster 安裝
Hadoop cluster 安裝
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
LuisRodriguezLocalDevEnvironmentsDrupalOpenDays
LuisRodriguezLocalDevEnvironmentsDrupalOpenDaysLuisRodriguezLocalDevEnvironmentsDrupalOpenDays
LuisRodriguezLocalDevEnvironmentsDrupalOpenDays
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Implementing Hadoop on a single cluster

  • 1. Implementing Hadoop on a Single Cluster - S A L IL NAVG IR E
  • 2. Basic Setup 1. Install Ubuntu 2. Install Java, Python and update 3. Add group ‘hadoop’ and ‘hduser’ as user (for security and backup) 4. Configure SSH a) b) Configure it by editing file ssh_config and save a backup c) Generate ssh key for hduser d) Enable ssh access to your local machine with the newly created RSA key e) 5. Install OpenSSH Server hduser@Ubuntu:~$ ssh localhost Disable IPv6 in sysctl.conf file in editor
  • 3. Installing Hadoop 1. Download hadoop from the collection of Apache Download Mirrors • salil@ubuntu:/usr/local$ sudo tar xzf hadoop-2.0.6-alphasrc.tar.gz 2. Make sure to change the owner to hduser in hadoop group • $ sudo chown -R hduser:hadoop hadoop (change the permissions) 3. Update $HOME/.bashrc – hadoop related environment variables
  • 4. Configuration 1. Edit environment variables in conf/hadoop-env.sh 2. Change settings in conf/*site.xml 3. Make directory and set the required ownerships and permissions • Now we create the directory and set the required ownerships and permissions: • $ sudo mkdir -p /app/hadoop/tmp • $ sudo chown hduser:hadoop /app/hadoop/tmp • $ sudo chmod 750 /app/hadoop/tmp 4. Add configurations snippets between <configuration> ... </configuration> tags in core-site.xml, mapredsite.xml and hdfs-site.xml
  • 5. Starting your single node cluster • First format the namenode • hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format • Start your single node cluster
  • 6. • Running a MapReduce job • Download data and copy from local file to hdfs • hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/project.txt /user/new • hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/hadoop/project.txt /user/lol
  • 7. • hduser@ubuntu:~$ hadoop dfs -ls /user/lol Found 2 items drwxr-xr-x - hduser supergroup 0 2013-10-10 06:30 /user/lol/output -rw-r--r-- 1 hduser supergroup 969039 2013-1005 20:20 /user/lol/project.txt • hduser@ubuntu:~$ hadoop jar /home/hduser/hadoop/hadoop-examples-1.0.3.jar wordcount /user/lol/project.txt /user/lol/output/ • Hadoop Web interfaces • http://localhost:50070/ – web UI of the NameNode daemon • http://localhost:50030/ – web UI of the JobTracker daemon • http://localhost:50060/ – web UI of the TaskTracker daemon
  • 8. • The NameNode Web interface gives us a cluster summary about total /remaining, capacity, live and dead nodes. • Aditionally we can browse the HDFS to view contents of files and log
  • 9. • The Jobtracker Web interface provides general job statistics about Hadoop cluster, running/complet ed/failed jobs and a job history log file • Tasktracker provides info about running and non-running tasks
  • 10. Writing MapReduce programs • Hadoop framework is written in java, which is complicated to code for Non-CS guys • Can be written in Python and converted to .jar file using Jython to run on a Hadoop cluster • But Jython has incomplete standard library because some Python features not provided in Jython • Alternative is to use Hadoop Streaming • Hadoop streaming is the utility that comes with Hadoop distribution; able to run any executable script as a mapper and reducer
  • 11. • Write mapper.py and reducer.py in python • Download and copy data to HDFS • Run same as previous java implementation • There are other third party solutions of Python Mapreduce which are similar to Streaming/Jython but can be easily used as a library in Python
  • 12. Python implementation stratagies • Streaming • mrjob • dumbo • Hadoopy • Non-Hadoop • disco • Prefer Hadoop streaming if possible because it is easy and has the lowest overhead • Prefer mrjob where you need higher abstraction and integration with AWS
  • 13. Future Work…. • Python implementation in Hadoop • Running Hadoop in Multi node cluster • Pig and its implementation on linux • Apache Mahout, Hive, Solr