Implementing Hadoop on a single cluster

•Download as PPTX, PDF•

0 likes•999 views

Salil Navgire

Technology

Basic Setup
1.

Install Ubuntu

2.

Install Java, Python and update

3.

Add group ‘hadoop’ and ‘hduser’ as user (for security and
backup)

4.

Configure SSH
a)
b)

Configure it by editing file ssh_config and save a backup

c)

Generate ssh key for hduser

d)

Enable ssh access to your local machine with the newly created RSA
key

e)

5.

Install OpenSSH Server

hduser@Ubuntu:~$ ssh localhost

Disable IPv6 in sysctl.conf file in editor

Installing Hadoop
1. Download hadoop from the collection of Apache
Download Mirrors
• salil@ubuntu:/usr/local$ sudo tar xzf hadoop-2.0.6-alphasrc.tar.gz

2. Make sure to change the owner to hduser in
hadoop group
• $ sudo chown -R hduser:hadoop hadoop (change the
permissions)

3. Update $HOME/.bashrc – hadoop related
environment variables

Configuration
1. Edit environment variables in conf/hadoop-env.sh
2. Change settings in conf/*site.xml
3. Make directory and set the required ownerships and
permissions
• Now we create the directory and set the required ownerships
and permissions:
• $ sudo mkdir -p /app/hadoop/tmp
• $ sudo chown hduser:hadoop /app/hadoop/tmp
• $ sudo chmod 750 /app/hadoop/tmp

4. Add configurations snippets between <configuration>
... </configuration> tags in core-site.xml, mapredsite.xml and hdfs-site.xml

Starting your single node cluster
• First format the namenode
•

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop
namenode -format

• Start your single node cluster

• Running a MapReduce job
• Download data and copy from local file to hdfs
• hduser@ubuntu:~$ hadoop dfs -copyFromLocal
/home/hduser/project.txt /user/new
• hduser@ubuntu:~$ hadoop dfs -copyFromLocal
/home/hduser/hadoop/project.txt /user/lol

• hduser@ubuntu:~$ hadoop dfs -ls /user/lol
Found 2 items
drwxr-xr-x - hduser supergroup
0 2013-10-10
06:30 /user/lol/output
-rw-r--r-- 1 hduser supergroup 969039 2013-1005 20:20 /user/lol/project.txt
• hduser@ubuntu:~$ hadoop jar
/home/hduser/hadoop/hadoop-examples-1.0.3.jar
wordcount /user/lol/project.txt /user/lol/output/
• Hadoop Web interfaces
• http://localhost:50070/ – web UI of the NameNode daemon
• http://localhost:50030/ – web UI of the JobTracker daemon
• http://localhost:50060/ – web UI of the TaskTracker daemon

• The NameNode
Web interface gives
us a cluster
summary about
total /remaining,
capacity, live and
dead nodes.
• Aditionally we can
browse the HDFS to
view contents of
files and log

• The Jobtracker
Web interface
provides general
job statistics
about Hadoop
cluster,
running/complet
ed/failed jobs
and a job history
log file
• Tasktracker
provides info
about running
and non-running
tasks

Writing MapReduce programs
• Hadoop framework is written in java, which is
complicated to code for Non-CS guys
• Can be written in Python and converted to .jar file using
Jython to run on a Hadoop cluster

• But Jython has incomplete standard library because
some Python features not provided in Jython
• Alternative is to use Hadoop Streaming

• Hadoop streaming is the utility that comes with Hadoop
distribution; able to run any executable script as a
mapper and reducer

• Write mapper.py and reducer.py in python
• Download and copy data to HDFS

• Run same as previous java implementation
• There are other third party solutions of Python
Mapreduce which are similar to Streaming/Jython
but can be easily used as a library in Python

Python implementation stratagies
• Streaming
• mrjob
• dumbo
• Hadoopy

• Non-Hadoop
• disco

• Prefer Hadoop streaming if possible because it is
easy and has the lowest overhead
• Prefer mrjob where you need higher abstraction
and integration with AWS

Future Work….
• Python implementation in Hadoop
• Running Hadoop in Multi node cluster
• Pig and its implementation on linux
• Apache Mahout, Hive, Solr

What's hot

8a. How To Setup HBase with DockerFabio Fumarola

An example Hadoop InstallMike Frampton

Apache HDFS - Lab AssignmentFarzad Nozarian

Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)Nag Arvind Gudiseva

Shark - Lab AssignmentFarzad Nozarian

Web scraping with nutch solr part 2Mike Frampton

Introduction to Apache HiveAvkash Chauhan

Install hadoop in a clusterXuhong Zhang

Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Apache HBase - Lab AssignmentFarzad Nozarian

Boulder dev ops-meetup-11-2012-rundeckWill Sterling

Beeswax Hive editor in HueRomain Rigaux

Large Scale Crawling with Apache Nutch and Friendslucenerevolution

Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Configuringahadoopmensb

PigRamakrishna kapa

Run wordcount job (hadoop)valeri kopaleishvili

What's hot (17)

8a. How To Setup HBase with Docker

An example Hadoop Install

Apache HDFS - Lab Assignment

Hadoop 2.0 cluster setup on ubuntu 14.04 (64 bit)

Shark - Lab Assignment

Web scraping with nutch solr part 2

Introduction to Apache Hive

Install hadoop in a cluster

Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab

Apache HBase - Lab Assignment

Boulder dev ops-meetup-11-2012-rundeck

Beeswax Hive editor in Hue

Large Scale Crawling with Apache Nutch and Friends

Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab

Configuringahadoop

Pig

Run wordcount job (hadoop)

Viewers also liked

Salil presentation 11.07Salil Navgire

MapReduce and HadoopSalil Navgire

Anomaly DetectionSalil Navgire

Hadoop Overview kdd2011Milind Bhandarkar

Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit

Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold

Data Mining and Recommendation SystemsSalil Navgire

Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopDATAVERSITY

Modeling with Hadoop kdd2011Milind Bhandarkar

Monitor PowerKVM using Ganglia, NagiosPradeep Kumar

Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta

Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks

Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer

A Reference Architecture for ETL 2.0 DataWorks Summit

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.

Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks

Hadoop and Enterprise Data WarehouseDataWorks Summit

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks

Hadoop and Your Data WarehouseCaserta

Large scale ETL with HadoopOReillyStrata

Viewers also liked (20)

Salil presentation 11.07

MapReduce and Hadoop

Anomaly Detection

Hadoop Overview kdd2011

Challenges of Implementing an Advanced SQL Engine on Hadoop

Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)

Data Mining and Recommendation Systems

Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop

Modeling with Hadoop kdd2011

Monitor PowerKVM using Ganglia, Nagios

Hadoop installation, Configuration, and Mapreduce program

Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...

Hadoop Integration into Data Warehousing Architectures

A Reference Architecture for ETL 2.0

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Implementing a Data Lake with Enterprise Grade Data Governance

Hadoop and Enterprise Data Warehouse

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...

Hadoop and Your Data Warehouse

Large scale ETL with Hadoop

Similar to Implementing Hadoop on a single cluster

02 Hadoop deployment and configurationSubhas Kumar Ghosh

Big data with hadoop Setup on Ubuntu 12.04Mandakini Kumari

Single node hadoop cluster installation Mahantesh Angadi

Hadoop single node setupMohammad_Tariq

DC HUG Hadoop for WindowsTerry Padgett

Cloudera hadoop installationSumitra Pundlik

Playing with Hadoop (NPW2013)Søren Lund

Big data using Hadoop, Hive, Sqoop with Installationmellempudilavanya999

Big data processing using hadoop poster presentationAmrut Patil

Exp-3.pptxPraveenKumar581409

Hadoop cluster 安裝recast203

Distro-independent Hadoop cluster managementDataWorks Summit

Hadoop installation with an exampleNikita Kesharwani

Hadoop cluster configurationprabakaranbrick

Asbury Hadoop OverviewBrian Enochson

Yahoo! Hack Europe WorkshopHortonworks

Configure h base hadoop and hbase clientShashwat Shriparv

LuisRodriguezLocalDevEnvironmentsDrupalOpenDaysLuis Rodríguez Castromil

Hadoop 20111117exsuns

#WeSpeakLinux SessionKellyn Pot'Vin-Gorman

Similar to Implementing Hadoop on a single cluster (20)

02 Hadoop deployment and configuration

Big data with hadoop Setup on Ubuntu 12.04

Single node hadoop cluster installation

Hadoop single node setup

DC HUG Hadoop for Windows

Cloudera hadoop installation

Playing with Hadoop (NPW2013)

Big data using Hadoop, Hive, Sqoop with Installation

Big data processing using hadoop poster presentation

Exp-3.pptx

Hadoop cluster 安裝

Distro-independent Hadoop cluster management

Hadoop installation with an example

Hadoop cluster configuration

Asbury Hadoop Overview

Yahoo! Hack Europe Workshop

Configure h base hadoop and hbase client

LuisRodriguezLocalDevEnvironmentsDrupalOpenDays

Hadoop 20111117

#WeSpeakLinux Session

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Exploring Multimodal Embeddings with MilvusZilliz

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - The value of a flexible API Management solution for O...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Vector Search -An Introduction in Oracle Database 23ai.pptx

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Understanding the FAA Part 107 License ..

Boost Fertility New Invention Ups Success Rates.pdf

[BuildWithAI] Introduction to Gemini.pdf

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

How to Troubleshoot Apps for the Modern Connected Worker

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Artificial Intelligence Chap.5 : Uncertainty

CNIC Information System with Pakdata Cf In Pakistan

Exploring Multimodal Embeddings with Milvus

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Implementing Hadoop on a single cluster

1. Implementing Hadoop on a Single Cluster - S A L IL NAVG IR E

2. Basic Setup 1. Install Ubuntu 2. Install Java, Python and update 3. Add group ‘hadoop’ and ‘hduser’ as user (for security and backup) 4. Configure SSH a) b) Configure it by editing file ssh_config and save a backup c) Generate ssh key for hduser d) Enable ssh access to your local machine with the newly created RSA key e) 5. Install OpenSSH Server hduser@Ubuntu:~$ ssh localhost Disable IPv6 in sysctl.conf file in editor

3. Installing Hadoop 1. Download hadoop from the collection of Apache Download Mirrors • salil@ubuntu:/usr/local$ sudo tar xzf hadoop-2.0.6-alphasrc.tar.gz 2. Make sure to change the owner to hduser in hadoop group • $ sudo chown -R hduser:hadoop hadoop (change the permissions) 3. Update $HOME/.bashrc – hadoop related environment variables

4. Configuration 1. Edit environment variables in conf/hadoop-env.sh 2. Change settings in conf/*site.xml 3. Make directory and set the required ownerships and permissions • Now we create the directory and set the required ownerships and permissions: • $ sudo mkdir -p /app/hadoop/tmp • $ sudo chown hduser:hadoop /app/hadoop/tmp • $ sudo chmod 750 /app/hadoop/tmp 4. Add configurations snippets between <configuration> ... </configuration> tags in core-site.xml, mapredsite.xml and hdfs-site.xml

5. Starting your single node cluster • First format the namenode • hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format • Start your single node cluster

6. • Running a MapReduce job • Download data and copy from local file to hdfs • hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/project.txt /user/new • hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/hadoop/project.txt /user/lol

7. • hduser@ubuntu:~$ hadoop dfs -ls /user/lol Found 2 items drwxr-xr-x - hduser supergroup 0 2013-10-10 06:30 /user/lol/output -rw-r--r-- 1 hduser supergroup 969039 2013-1005 20:20 /user/lol/project.txt • hduser@ubuntu:~$ hadoop jar /home/hduser/hadoop/hadoop-examples-1.0.3.jar wordcount /user/lol/project.txt /user/lol/output/ • Hadoop Web interfaces • http://localhost:50070/ – web UI of the NameNode daemon • http://localhost:50030/ – web UI of the JobTracker daemon • http://localhost:50060/ – web UI of the TaskTracker daemon

8. • The NameNode Web interface gives us a cluster summary about total /remaining, capacity, live and dead nodes. • Aditionally we can browse the HDFS to view contents of files and log

9. • The Jobtracker Web interface provides general job statistics about Hadoop cluster, running/complet ed/failed jobs and a job history log file • Tasktracker provides info about running and non-running tasks

10. Writing MapReduce programs • Hadoop framework is written in java, which is complicated to code for Non-CS guys • Can be written in Python and converted to .jar file using Jython to run on a Hadoop cluster • But Jython has incomplete standard library because some Python features not provided in Jython • Alternative is to use Hadoop Streaming • Hadoop streaming is the utility that comes with Hadoop distribution; able to run any executable script as a mapper and reducer

11. • Write mapper.py and reducer.py in python • Download and copy data to HDFS • Run same as previous java implementation • There are other third party solutions of Python Mapreduce which are similar to Streaming/Jython but can be easily used as a library in Python

12. Python implementation stratagies • Streaming • mrjob • dumbo • Hadoopy • Non-Hadoop • disco • Prefer Hadoop streaming if possible because it is easy and has the lowest overhead • Prefer mrjob where you need higher abstraction and integration with AWS

13. Future Work…. • Python implementation in Hadoop • Running Hadoop in Multi node cluster • Pig and its implementation on linux • Apache Mahout, Hive, Solr

Implementing Hadoop on a single cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to Implementing Hadoop on a single cluster

Similar to Implementing Hadoop on a single cluster (20)

Recently uploaded

Recently uploaded (20)

Implementing Hadoop on a single cluster