getFamiliarWithHadoop

AmirReza Mohammadi
AmirReza MohammadiCEO em Vira Afzar Infrastructure Development Co
lug.getFamiliarWithHadoop();
PRESENTED BY A.R.MOHAMMADI
AMIRMHD.IR
INTRODUCTION
• Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a
company creates.
• Data that would take too much time and cost too much money to load into a relational database for
analysis.
• Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and
exabytes of data.
DATA GENERATION
• The New York Stock Exchange generates about one terabyte of new trade data per day.
• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
• Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
• The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per
month.
• The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of data per year.
WHAT CAUSED THE PROBLEM
0%
100%
1 2
Year
Standard Hard Drive Size
(in Mb)
1990 1370
2010 1000000
4%
96%
1 2
Year
Data Transfer Rate
(Mbps)
1990 4.4
2010 100
SO WHAT IS THE PROBLEM?
• The transfer speed is around 100 MB/s
• A standard disk is 1 Terabyte
• Time to read entire disk = 10000 seconds or 3 Hours!
• Increase in processing rate may not be as helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips have been reached
getFamiliarWithHadoop
SO WHAT DO WE DO?
• The obvious solution is that we use multiple
processors to solve the same problem by
fragmenting it into pieces.
• Imagine if we had 100 drives, each holding one
hundredth of the data. Working in parallel, we could
read the data in under two minutes.
Parallelization- Multiple processors or CPU’s in a
single machine
Distributed Computing- Multiple computers
connected via a network
DISTRIBUTED COMPUTING VS PARALLELIZATION
DISTRIBUTED COMPUTING
The key issues involved in this Solution:
• Hardware failure
• Combine the data after analysis
• Network Associated Problems
WHAT CAN WE DO WITH A DISTRIBUTED COMPUTER
SYSTEM?
• IBM Deep Blue
• Index the Web (Google)
• Simulating an internet size network for network experiments
• Analysing Complex Networks
• ...
PROBLEMS IN DISTRIBUTED COMPUTING
• Hardware Failure:
As soon as we start using many pieces of hardware, the chance
that one will fail is fairly high.
• Combine the data after analysis:
Most analysis tasks need to be able to combine the data in some
way; data read from one disk may need to be combined with the
data from any of the other 99 disks.
HADOOP
• Apache Hadoop is an open-source software framework that supports data-intensive distributed
applications, licensed under the Apache v2 license.
• A common way of avoiding data loss is through replication: redundant copies of the data are kept by the
system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem
(HDFS), takes care of this problem.
• The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open
source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of
very large data sets.
DEVELOPER
Doug Cutting
2005: Doug Cutting and Michael J. Cafarella developed
Hadoop to support distribution for the Nutch search
engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
WHAT ELSE IS HADOOP?
A reliable shared storage and analysis system.
There are other subprojects of Hadoop that provide complementary services, or build on the core to add
higher-level abstractions The various subprojects of hadoop include:
1. Core
2. Avro
3. Pig
4. HBase
5. Zookeeper
6. Hive
7. Chukwa
HADOOP APPROACH TO DISTRIBUTED COMPUTING
The theoretical 1000-CPU machine
would cost a very large amount of
money, far more than 1,000 single-CPU.
Hadoop will tie these smaller and more
reasonably priced machines together
into a single cost-effective compute
cluster.
2008 - Hadoop Wins Terabyte Sort Benchmark
(sorted 1 terabyte of data in 209 seconds, compared
to previous record of 297 seconds)
MAP-REDUCE
• Hadoop limits the amount of communication which can be performed by the processes, as each
individual record is processed by a task in isolation from one another
• By restricting the communication between nodes, Hadoop makes the distributed system much more
reliable. Individual node failures can be worked around by restarting tasks on other machines.
• The other workers continue to operate as though nothing went wrong, leaving the challenging aspects of
partially restarting the program to the underlying Hadoop layer.
Map : (in_value,in_key)(out_key, intermediate_value)
Reduce: (out_key, intermediate_value) (out_value list)
WHAT IS MAPREDUCE?
• MapReduce is a programming model
• Programs written in this functional style are automatically parallelized and executed on a large
cluster of commodity machines
• MapReduce is an associated implementation for processing and generating large data sets.
MapReduce
MAP
map function that
processes a key/value pair
to generate a set of
intermediate key/value
pairs
REDUCE
and a reduce function that
merges all intermediate
values associated with the
same intermediate key.
MAP
REDUCE
reduce(out_key, intermediate_value list) -> out_value list
EXAMPLE
 // key: document name
 // value: document contents
 for each word w in value:
 EmitIntermediate(w, "1");
 reduce(String key, Iterator values):
 // key: a word
 // values: a list of counts
 int result = 0;
 for each v in values:
 result += ParseInt(v);
 Emit(AsString(result));
DATA LOCALITY OPTIMIZATION
• The computer nodes and the storage nodes are the same. The Map-Reduce framework and the
Distributed File System run on the same set of nodes. This configuration allows the framework to
effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
• If this is not possible : The computation is done by another processor on the same rack.
MAPREDUCE DATA FLOW WITH A SINGLE REDUCE TASK
MAPREDUCE DATA FLOW WITH MULTIPLE REDUCE TASKS
HADOOP STREAMING:
• Hadoop provides an API to MapReduce that allows you to write your map and reduce
functions in languages other than Java.
• Hadoop Streaming uses Unix standard streams as the interface between Hadoop and
your program, so you can use any language that can read standard input and write to
standard output to write your MapReduce program.
HADOOP PIPES:
• Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
• Unlike Streaming, which uses standard input and output to communicate with the map and
reduce code, Pipes uses sockets as the channel over which the tasktracker communicates
with the process running the C++ map or reduce function.
HADOOP DISTRIBUTED FILESYSTEM (HDFS)
• Filesystems that manage the storage across a network of machines are called distributed
filesystems.
• HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very
large amounts of data (terabytes or even petabytes), and provide high-throughput access to this
information.
PROBLEMS IN DISTRIBUTED FILE SYSTEMS
Making distributed filesystems is more complex than regular disk filesystems. This is because the data is
spanned over multiple nodes, so all the complications of network programming kick in.
 Hardware Failure
An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact
that there are a huge number of components and that each component has a non-trivial probability of failure means that some
component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
 Large Data Sets
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to
support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should
support tens of millions of files in a single instance.
DESIGN OF HDFS
• Very large files
Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running
today that store petabytes of data.
• Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-
times pattern.
A dataset is typically generated or copied from source, then various analyses are performed on that
dataset over time. Each analysis will involve a large proportion of the dataset, so the time to read the
whole dataset is more important than the latency in reading the first record.
NAMENODES AND DATANODES
• A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master)
and a number of datanodes (workers).
• The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata
for all the files and directories in the tree.
• Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to
(by clients or the namenode), and they report back to the namenode periodically with lists of blocks
that they are storing.
Without the namenode, the filesystem cannot be used. In fact, if the
machine running the namenode were obliterated, all the files on the
filesystem would be lost since there would be no way of knowing
how to reconstruct the files from the blocks on the datanodes.
DATA REPLICATION
• The blocks of a file are replicated for fault tolerance.
• The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat
and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the
DataNode is functioning properly.
• A Blockreport contains a list of all blocks on a DataNode.
• When the replication factor is three, HDFS’s placement policy is to put one replica on one node in the
local rack, another on a different node in the local rack, and the last on a different node in a different
rack.
BIBLIOGRAPHY
1. Hadoop- The Definitive Guide, O’Reilly 2009, Yahoo! Press
2. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
3. Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce, Delip Rao, David
Yarowsky, Dept. of Computer Science, Johns Hopkins University
4. Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia, Andy Konwinski,
Anthony D. Joseph, Randy Katz, Ion Stoica, University of California, Berkeley
5. MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball, Winter 2007
getFamiliarWithHadoop
INSTALLING JAVA
 Update the source list
 sudo apt-get update
 The OpenJDK project is the default version of Java that is provided from a supported Ubuntu repository.
 sudo apt-get install default-jdk
 Or Install Sun Java 8 JDK
 sudo apt-get install openjdk-8-jdk
 After installation, make a quick check whether Sun’s JDK is correctly set up:
 java -version
ADDING A DEDICATED HADOOP SYSTEM USER
 sudo addgroup hadoop
 sudo adduser --ingroup hadoop hduser
CONFIGURING SSH
 user@ubuntu:~$ su - hduser
 hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
 hduser@ubuntu:~$ usermod –aG sudo hduser
ENABLE SSH ACCESS TO YOUR LOCAL MACHINE
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
TEST
 hduser@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
 hduser@ubuntu:~$
DISABLING IPV6
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of
your choice and add the following lines to the end of the file:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
 You can check whether IPv6 is enabled on your machine with the following command:
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
INSTALLING HADOOP
cd /usr/local
sudo tar xzf hadoop-*.tar.gz
sudo mv hadoop-* hadoop
sudo chown -R hduser:hadoop hadoop
SETUP CONFIGURATION FILES
1. ~/.bashrc
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
3. /usr/local/hadoop/etc/hadoop/core-site.xml
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
UPDATE $HOME/.BASHRC
 Append the following to the end of ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 //java folder
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
SET JAVA_HOME BY MODIFYING HADOOP-ENV.SH FILE
 vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
CORE-SITE.XML
 sudo mkdir -p /app/hadoop/tmp
 sudo chown hduser:hadoop /app/hadoop/tmp
 vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
MAPRED-SITE.XML
 cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
 vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
CONF/HDFS-SITE.XML
 vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
FORMATTING THE HDFS FILESYSTEM VIA THE NAMENODE
 /usr/local/hadoop/bin/hadoop namenode –format
10/05/08 16:namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri
Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
STARTING YOUR SINGLE-NODE CLUSTER
 /usr/local/hadoop/bin/start-all.sh
 jps
COPY LOCAL EXAMPLE DATA TO HDFS
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg
/user/hduser/gutenberg
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
Found 1 items
drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg
Found 3 items
-rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt
-rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt
-rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt
• hduser@ubuntu:/usr/local/hadoop$
RUN THE MAPREDUCE JOB
• hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount
/user/hduser/gutenberg /user/hduser/gutenberg-output
CHECK IF THE RESULT IS SUCCESSFULLY STORED IN
HDFS DIRECTORY
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
• Found 2 items
• drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
• drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output
• Found 2 items
• drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs
• -rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000
• hduser@ubuntu:/usr/local/hadoop$
RETRIEVE THE JOB RESULT FROM HDFS
• hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000
COPY THE RESULTS TO THE LOCAL FILE SYSTEM
 hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output
 hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1
• hduser@ubuntu:/usr/local/hadoop$
HADOOP WEB INTERFACES
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at
these locations:
 http://localhost:50070/ – web UI of the NameNode daemon
 http://localhost:50030/ – web UI of the JobTracker daemon
 http://localhost:50060/ – web UI of the TaskTracker daemon
1 de 54

Recomendados

Hadoop Technology por
Hadoop TechnologyHadoop Technology
Hadoop TechnologyAtul Kushwaha
2.5K visualizações22 slides
Apache hadoop technology : Beginners por
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
215 visualizações21 slides
Introduction to Hadoop por
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
2.5K visualizações30 slides
Distributed Computing with Apache Hadoop: Technology Overview por
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
6.7K visualizações36 slides
Apache Hadoop In Theory And Practice por
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
20K visualizações96 slides
An Introduction to Hadoop por
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to HadoopDerrekYoungDotCom
5.4K visualizações31 slides

Mais conteúdo relacionado

Mais procurados

Hadoop technology por
Hadoop technologyHadoop technology
Hadoop technologytipanagiriharika
2.9K visualizações85 slides
Hadoop and big data por
Hadoop and big dataHadoop and big data
Hadoop and big dataSharad Pandey
559 visualizações20 slides
Hadoop distributed computing framework for big data por
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
2.5K visualizações23 slides
EclipseCon Keynote: Apache Hadoop - An Introduction por
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
1.8K visualizações56 slides
Introduction to Hadoop part 2 por
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
81 visualizações25 slides
Hadoop training in hyderabad-kellytechnologies por
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
289 visualizações40 slides

Mais procurados(20)

Hadoop technology por tipanagiriharika
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika2.9K visualizações
Hadoop and big data por Sharad Pandey
Hadoop and big dataHadoop and big data
Hadoop and big data
Sharad Pandey559 visualizações
Hadoop distributed computing framework for big data por Cyanny LIANG
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG2.5K visualizações
EclipseCon Keynote: Apache Hadoop - An Introduction por Cloudera, Inc.
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.1.8K visualizações
Introduction to Hadoop part 2 por Giovanna Roda
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
Giovanna Roda81 visualizações
Hadoop training in hyderabad-kellytechnologies por Kelly Technologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies289 visualizações
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011) por Hari Shankar Sreekumar
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar3.5K visualizações
Apache Hadoop por Ajit Koti
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti5.1K visualizações
Data model for analysis of scholarly documents in the MapReduce paradigm por Adam Kawa
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
Adam Kawa1.8K visualizações
2. hadoop fundamentals por Lokesh Ramaswamy
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy727 visualizações
Hadoop architecture meetup por vmoorthy
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
vmoorthy891 visualizações
Hadoop seminar por KrishnenduKrishh
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh481 visualizações
Hadoop File system (HDFS) por Prashant Gupta
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta8.4K visualizações
HDFS por Steve Loughran
HDFSHDFS
HDFS
Steve Loughran4.4K visualizações
Intro to Hadoop por jeffturner
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner1.7K visualizações
Hadoop Seminar Report por Atul Kushwaha
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha3.2K visualizações
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013) por Adam Kawa
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Adam Kawa41.5K visualizações
02.28.13 WANdisco ApacheCon 2013 por WANdisco Plc
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc2.1K visualizações
Introduction to Hadoop part1 por Giovanna Roda
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
Giovanna Roda156 visualizações

Similar a getFamiliarWithHadoop

Big data and hadoop por
Big data and hadoopBig data and hadoop
Big data and hadoopRoushan Sinha
32 visualizações21 slides
hadoop por
hadoophadoop
hadoopswatic018
96 visualizações29 slides
hadoop por
hadoophadoop
hadoopswatic018
169 visualizações29 slides
Seminar Presentation Hadoop por
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
81.5K visualizações51 slides
Apache hadoop basics por
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
1.1K visualizações20 slides
Topic 9a-Hadoop Storage- HDFS.pptx por
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
15 visualizações37 slides

Similar a getFamiliarWithHadoop(20)

Big data and hadoop por Roushan Sinha
Big data and hadoopBig data and hadoop
Big data and hadoop
Roushan Sinha32 visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic01896 visualizações
hadoop por swatic018
hadoophadoop
hadoop
swatic018169 visualizações
Seminar Presentation Hadoop por Varun Narang
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang81.5K visualizações
Apache hadoop basics por saili mane
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane1.1K visualizações
Topic 9a-Hadoop Storage- HDFS.pptx por DanishMahmood23
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood2315 visualizações
Hadoop training-in-hyderabad por sreehari orienit
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit406 visualizações
Managing Big data with Hadoop por Nalini Mehta
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta787 visualizações
Hadoop data management por Subhas Kumar Ghosh
Hadoop data managementHadoop data management
Hadoop data management
Subhas Kumar Ghosh1.3K visualizações
OPERATING SYSTEM .pptx por AltafKhadim
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim6 visualizações
Introduction to Apache Hadoop Ecosystem por Mahabubur Rahaman
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman5.3K visualizações
Cppt Hadoop por chunkypandey12
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey1227 visualizações
Cppt por chunkypandey12
CpptCppt
Cppt
chunkypandey12126 visualizações
Cppt por chunkypandey12
CpptCppt
Cppt
chunkypandey12147 visualizações
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women por maharajothip1
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip119 visualizações
Big Data Unit 4 - Hadoop por RojaT4
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4176 visualizações
Big data and hadoop por Mohit Tare
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare3K visualizações
Apache Hadoop Big Data Technology por Jay Nagar
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar495 visualizações
Big Data and Hadoop por Mr. Ankit
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit253 visualizações

Último

[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... por
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...DataScienceConferenc1
7 visualizações15 slides
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion por
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
9 visualizações37 slides
Data about the sector workshop por
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
29 visualizações27 slides
Construction Accidents & Injuries por
Construction Accidents & InjuriesConstruction Accidents & Injuries
Construction Accidents & InjuriesBisnar Chase Personal Injury Attorneys
6 visualizações5 slides
CRM stick or twist workshop por
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshopinfo828217
14 visualizações16 slides
Listed Instruments Survey 2022.pptx por
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptxsecretariat4
121 visualizações12 slides

Último(20)

[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... por DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
DataScienceConferenc17 visualizações
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion por Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Bertram Ludäscher9 visualizações
Data about the sector workshop por info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 visualizações
CRM stick or twist workshop por info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 visualizações
Listed Instruments Survey 2022.pptx por secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4121 visualizações
Dr. Ousmane Badiane-2023 ReSAKSS Conference por AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 visualizações
DGST Methodology Presentation.pdf por maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 visualizações
K-Drama Recommendation Using Python por FridaPutriassa
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using Python
FridaPutriassa5 visualizações
Employees attrition por MaryAlejandraDiaz
Employees attritionEmployees attrition
Employees attrition
MaryAlejandraDiaz7 visualizações
Inawisdom Quick Sight por PhilipBasford
Inawisdom Quick SightInawisdom Quick Sight
Inawisdom Quick Sight
PhilipBasford7 visualizações
Oral presentation (1).pdf por reemalmazroui8
Oral presentation (1).pdfOral presentation (1).pdf
Oral presentation (1).pdf
reemalmazroui85 visualizações
Infomatica-MDM.pptx por Kapil Rangwani
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptx
Kapil Rangwani11 visualizações
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo... por DataScienceConferenc1
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
DataScienceConferenc19 visualizações
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... por DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
DataScienceConferenc18 visualizações
Ukraine Infographic_22NOV2023_v2.pdf por AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K visualizações
Product Research sample.pdf por AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson33 visualizações
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf por Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus31 visualizações

getFamiliarWithHadoop

  • 2. INTRODUCTION • Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. • Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 3. DATA GENERATION • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of data per year.
  • 4. WHAT CAUSED THE PROBLEM 0% 100% 1 2 Year Standard Hard Drive Size (in Mb) 1990 1370 2010 1000000 4% 96% 1 2 Year Data Transfer Rate (Mbps) 1990 4.4 2010 100
  • 5. SO WHAT IS THE PROBLEM? • The transfer speed is around 100 MB/s • A standard disk is 1 Terabyte • Time to read entire disk = 10000 seconds or 3 Hours! • Increase in processing rate may not be as helpful because • Network bandwidth is now more of a limiting factor • Physical limits of processor chips have been reached
  • 7. SO WHAT DO WE DO? • The obvious solution is that we use multiple processors to solve the same problem by fragmenting it into pieces. • Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
  • 8. Parallelization- Multiple processors or CPU’s in a single machine Distributed Computing- Multiple computers connected via a network DISTRIBUTED COMPUTING VS PARALLELIZATION
  • 9. DISTRIBUTED COMPUTING The key issues involved in this Solution: • Hardware failure • Combine the data after analysis • Network Associated Problems
  • 10. WHAT CAN WE DO WITH A DISTRIBUTED COMPUTER SYSTEM? • IBM Deep Blue • Index the Web (Google) • Simulating an internet size network for network experiments • Analysing Complex Networks • ...
  • 11. PROBLEMS IN DISTRIBUTED COMPUTING • Hardware Failure: As soon as we start using many pieces of hardware, the chance that one will fail is fairly high. • Combine the data after analysis: Most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks.
  • 12. HADOOP • Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. • A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem. • The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets.
  • 13. DEVELOPER Doug Cutting 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation.
  • 14. WHAT ELSE IS HADOOP? A reliable shared storage and analysis system. There are other subprojects of Hadoop that provide complementary services, or build on the core to add higher-level abstractions The various subprojects of hadoop include: 1. Core 2. Avro 3. Pig 4. HBase 5. Zookeeper 6. Hive 7. Chukwa
  • 15. HADOOP APPROACH TO DISTRIBUTED COMPUTING The theoretical 1000-CPU machine would cost a very large amount of money, far more than 1,000 single-CPU. Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.
  • 16. 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in 209 seconds, compared to previous record of 297 seconds)
  • 17. MAP-REDUCE • Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another • By restricting the communication between nodes, Hadoop makes the distributed system much more reliable. Individual node failures can be worked around by restarting tasks on other machines. • The other workers continue to operate as though nothing went wrong, leaving the challenging aspects of partially restarting the program to the underlying Hadoop layer. Map : (in_value,in_key)(out_key, intermediate_value) Reduce: (out_key, intermediate_value) (out_value list)
  • 18. WHAT IS MAPREDUCE? • MapReduce is a programming model • Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines • MapReduce is an associated implementation for processing and generating large data sets. MapReduce MAP map function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE and a reduce function that merges all intermediate values associated with the same intermediate key.
  • 19. MAP
  • 21. EXAMPLE  // key: document name  // value: document contents  for each word w in value:  EmitIntermediate(w, "1");  reduce(String key, Iterator values):  // key: a word  // values: a list of counts  int result = 0;  for each v in values:  result += ParseInt(v);  Emit(AsString(result));
  • 22. DATA LOCALITY OPTIMIZATION • The computer nodes and the storage nodes are the same. The Map-Reduce framework and the Distributed File System run on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. • If this is not possible : The computation is done by another processor on the same rack.
  • 23. MAPREDUCE DATA FLOW WITH A SINGLE REDUCE TASK
  • 24. MAPREDUCE DATA FLOW WITH MULTIPLE REDUCE TASKS
  • 25. HADOOP STREAMING: • Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. • Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.
  • 26. HADOOP PIPES: • Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. • Unlike Streaming, which uses standard input and output to communicate with the map and reduce code, Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function.
  • 27. HADOOP DISTRIBUTED FILESYSTEM (HDFS) • Filesystems that manage the storage across a network of machines are called distributed filesystems. • HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
  • 28. PROBLEMS IN DISTRIBUTED FILE SYSTEMS Making distributed filesystems is more complex than regular disk filesystems. This is because the data is spanned over multiple nodes, so all the complications of network programming kick in.  Hardware Failure An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.  Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
  • 29. DESIGN OF HDFS • Very large files Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. • Streaming data access HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many- times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
  • 30. NAMENODES AND DATANODES • A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). • The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. • Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
  • 31. Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
  • 32. DATA REPLICATION • The blocks of a file are replicated for fault tolerance. • The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. • A Blockreport contains a list of all blocks on a DataNode. • When the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack.
  • 33. BIBLIOGRAPHY 1. Hadoop- The Definitive Guide, O’Reilly 2009, Yahoo! Press 2. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat 3. Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce, Delip Rao, David Yarowsky, Dept. of Computer Science, Johns Hopkins University 4. Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, University of California, Berkeley 5. MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball, Winter 2007
  • 35. INSTALLING JAVA  Update the source list  sudo apt-get update  The OpenJDK project is the default version of Java that is provided from a supported Ubuntu repository.  sudo apt-get install default-jdk  Or Install Sun Java 8 JDK  sudo apt-get install openjdk-8-jdk  After installation, make a quick check whether Sun’s JDK is correctly set up:  java -version
  • 36. ADDING A DEDICATED HADOOP SYSTEM USER  sudo addgroup hadoop  sudo adduser --ingroup hadoop hduser
  • 37. CONFIGURING SSH  user@ubuntu:~$ su - hduser  hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key's randomart image is: [...snipp...]  hduser@ubuntu:~$ usermod –aG sudo hduser
  • 38. ENABLE SSH ACCESS TO YOUR LOCAL MACHINE cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys TEST  hduser@ubuntu:~$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...]  hduser@ubuntu:~$
  • 39. DISABLING IPV6 To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file: net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1  You can check whether IPv6 is enabled on your machine with the following command: cat /proc/sys/net/ipv6/conf/all/disable_ipv6
  • 40. INSTALLING HADOOP cd /usr/local sudo tar xzf hadoop-*.tar.gz sudo mv hadoop-* hadoop sudo chown -R hduser:hadoop hadoop
  • 41. SETUP CONFIGURATION FILES 1. ~/.bashrc 2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh 3. /usr/local/hadoop/etc/hadoop/core-site.xml 4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template 5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
  • 42. UPDATE $HOME/.BASHRC  Append the following to the end of ~/.bashrc #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 //java folder export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END
  • 43. SET JAVA_HOME BY MODIFYING HADOOP-ENV.SH FILE  vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh  export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  • 44. CORE-SITE.XML  sudo mkdir -p /app/hadoop/tmp  sudo chown hduser:hadoop /app/hadoop/tmp  vi /usr/local/hadoop/etc/hadoop/core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>
  • 45. MAPRED-SITE.XML  cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml  vi /usr/local/hadoop/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>
  • 46. CONF/HDFS-SITE.XML  vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration>
  • 47. FORMATTING THE HDFS FILESYSTEM VIA THE NAMENODE  /usr/local/hadoop/bin/hadoop namenode –format 10/05/08 16:namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted. 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/
  • 48. STARTING YOUR SINGLE-NODE CLUSTER  /usr/local/hadoop/bin/start-all.sh  jps
  • 49. COPY LOCAL EXAMPLE DATA TO HDFS  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser Found 1 items drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg Found 3 items -rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt -rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt -rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt • hduser@ubuntu:/usr/local/hadoop$
  • 50. RUN THE MAPREDUCE JOB • hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
  • 51. CHECK IF THE RESULT IS SUCCESSFULLY STORED IN HDFS DIRECTORY  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser • Found 2 items • drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg • drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output • Found 2 items • drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs • -rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000 • hduser@ubuntu:/usr/local/hadoop$
  • 52. RETRIEVE THE JOB RESULT FROM HDFS • hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000
  • 53. COPY THE RESULTS TO THE LOCAL FILE SYSTEM  hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output  hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output "(Lo)cra" 1 "1490 1 "1498," 1 "35" 1 "40," 1 "A 2 "AS-IS". 1 "A_ 1 "Absoluti 1 "Alack! 1 • hduser@ubuntu:/usr/local/hadoop$
  • 54. HADOOP WEB INTERFACES Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:  http://localhost:50070/ – web UI of the NameNode daemon  http://localhost:50030/ – web UI of the JobTracker daemon  http://localhost:50060/ – web UI of the TaskTracker daemon