1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.
Introducing Hadoop
Mastering Hadoop Map-reduce for Data Analysis
Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
What is Hadoop
3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
HDFS Architecture
4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Namenode/Datanode, JobTracker/TaskTracker
5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
MapReduce
6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
ZK Namespace
7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Essential HBase Schema
8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Multi-dimensional View
9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
A Map/Hash View
•{
• "row_key_1" : { "name" : {
• "first_name" : "Jolly", "last_name" : "Goodfellow"
• } } },
• "location" : { "zip": "94301" },
10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Architectural View (HBase)
11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
The Persistence Mechanism
12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
The underlying file format
13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Installing & Setting up Hadoop
• Required software: Java 1.6.x, ssh + sshd
• Download
• Install
• Configure
• single-node
• pseudo-distributed
• cluster
14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Download
• Source: http://hadoop.apache.org/
• Version:
• 0.20.203.x -- current stable
• 0.20.x -- previous stable
• Includes
• Hadoop Common -- common utilities, HDFS, MapReduce
15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Install
• Extract: tar zxvf hadoop-0.20.203.0rc1.tar.gz
• Move & Create Symbolic Link
• ln -s hadoop-0.20.203.0 hadoop
• On Windows
• http://developer.yahoo.com/hadoop/tutorial/module3.html
16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Configure -- single-node
• Edit: conf/hadoop-env.sh
• Set JAVA_HOME
• Default configuration is single-node
• Start bin/hadoop (for command options)
• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/
single_node_setup.html
17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Configure -- pseduo-distributed
• Edit: conf/core-site.xml (configure HDFS daemon)
• Edit: conf/hdfs-site.xml (configure HDFS replication factor)
• Edit: conf/mapred-site.xml (configure MapReduce JobTracker daemon)
• Enable ssh to localhost (without passphrase)
• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/
single_node_setup.html
18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Start Hadoop
• Format HDFS: bin/hadoop namenode -format
• Start all daemons: bin/start-all.sh
• Verify logs
• Browse the web interface:
• Namenode: http://localhost:50070/
• JobTracker: http://localhost:50030/
19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Take Hadoop for a test-drive
• Run examples (hadoop-examples-0.20.203.0.jar)
• Grep using regular expressions
• Copy files to HDFS: bin/hadoop fs -put bin input
• Grep for files which have text beginning with ‘start’
• Verify output on HDFS: bin/hadoop fs -cat output/*
• Copy output to local filesystem & verify: bin/hadoop fs -get output output
&& cat output/*
20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Configure -- cluster
• References:
• http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html
(official documentation)
• http://developer.yahoo.com/hadoop/tutorial/module7.html (Managing a
Hadoop Cluster. Source: YDN)
• http://wiki.datameer.com/display/DAS1/Hadoop+Cluster+Configuration+Tips
21. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
All other & referenced work is copyrighted to their respective owners
Questions?
• blog: shanky.org | twitter: @tshanky
• st@treasuryofideas.com