Hadoop

陳柏翰
CS13 http://about.me/sihalon
Computer System Administration 2011

只有天上在
更無山與齊
舉頭紅日近
回首白雲低

宋寇準(華山)

Outlines
 現有雲端服務
 Hadoop 背後概念
 Hadoop 單節點安裝
 簡單範例

什麼是雲端?
 Gmail
 YouTube
 Google Docs
…

簡單來說

即

凡能透過網際網路

能享受到的應用服務

現有的雲端運算服務
• Windows
• Google
• Amazon
• Yahoo 他們的背後?
• Plurk
• ……

Hadoop
Hadoop is a software platform that lets one easily write and run
applications that process vast amounts of data

What is Hadoop ?

 一種開放源碼雲端平台（框架）
 巨量資料計算解決方案
 穩定可擴充

Yahoo : Hadoop
 Apache 項目，Yahoo 資助、開發與運用
 2006年開始參與 Hadoop。
 2008年 2千臺伺服器。
執行超過1萬個Hadoop虛擬機器。
5 Petabytes的網頁內容
分析1兆個網路連結

Feature
• 巨量
– 擁有儲存與處理大量資料的能力

• 經濟
– 可以用在由一般PC所架設的叢集環境內

• 效率
– 平行分散檔案的處理以得到快速的回應

• 可靠
– 當某節點發生錯誤，系統能即時自動的取
得備份資料及佈署運算資源

架構
 HDFS
- Hadoop 專案中的檔案系統

 MapReduce
- 平行處理P級別以上的資料集

 Hbase
- 巨量資料庫系統

Divide and Conquer
 演算法（Algorithms）：
 Divide and Conquer
 分而治之

 在程式設計的軟體架構內，適合使用在大
規模數據的運算中

Divide and Conquer

範例一：方格法求面積範例二：鋪滿 L 形磁磚

Divide and Conquer
I am a tiger, you are also a tiger a,2
also,1
I,1 a,2 am,1
am,1 a, 1 also,1 are,1
map a,1 am,1
a,1 reduce I,1
also,1 are,1
tiger,2
tiger,1 am,1
you,1 are,1 you,1
map
are,1 I,1
tiger,1 I, 1
tiger,1 tiger,2
also,1 you,1 reduce you,1
map a, 1
tiger,1

Building Hadoop
Namenode

JobTracker

Data Task Data Task Data Task

Java Java Java

Linuux Linuux Linuux

Node1 Node2 Node3

一起飛上雲端吧

- Demo Time

Supported Platforms
 GNU/Linux is supported as a
development and production platform.
Hadoop has been demonstrated on
GNU/Linux clusters with 2000 nodes.
 Win32 is supported as a development
platform. Distributed operation has not
been well tested on Win32, so it is not
supported as a production platform.

Environment
 Ubuntu Linux 10.04 LTS
 Hadoop 0.20.2
- released on February 2010

Required Software
 JavaTM 1.6.x, preferably
from Sun, must be installed.

 ssh must be installed and
sshd must be running to
use the Hadoop scripts
that manage remote
Hadoop daemons.

Sun Java 6
1. Add repository to your apt repositories:
2. Update the source list

 $ sudo add-apt-repository "deb
http://archive.canonical.com/ lucid partner"
 $ sudo apt-get update

Sun Java 6
3. Install sun-java6-jdk
4. Select Sun’s Java as the default on your
machine.

 $ sudo apt-get install sun-java6-jdk
 $ sudo update-java-alternatives -s java-6-sun

Sun Java 6
5. Check whether it’s success !

 $ java -version

Configuring SSH
( You can find ssh software in Software Center by searhing “ssh”)

Configuring SSH
1. generate an SSH key for current user.
2. enable SSH access to your local machine
with this newly created key.

 $ ssh-keygen -t rsa -P “”
 $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
( cat test1.txt >> test2.txt 轉向附加)

Configuring SSH
3. Test by connecting to your local machine
( You should install ssh first )

 $ ssh localhost

Disabling IPv6
 $ sudo joe /etc/sysctl.conf
 #disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

 $ reboot

Disabling IPv6
check whether IPv6 is enabled on your machine
( 0 means enabled, 1 means disabled )

 $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Hadoop Installation
Download Hadoop from the Apache Mirrors
http://www.apache.org/dyn/closer.cgi/hadoop/core

 $ cd /home/csa
 $ wget
http://apache.ntu.edu.tw/hadoop/core/ha
doop-0.20.2/hadoop-0.20.2.tar.gz

Hadoop Installation
 $ sudo tar xzf hadoop-0.20.2.tar.gz
 $ sudo mv hadoop-0.20.2 hadoop

Hadoop Package Topology
 bin / 各執行檔：如 start-all.sh 、stop-all.sh 、 hadoop
 conf / 預設的設定檔目錄：設定環境變數、工作節點
slaves。
 docs / Hadoop API 與說明文件。
 contrib / 額外有用的功能套件，如：eclipse的擴充外掛。
 lib / 開發 hadoop 專案或編譯 hadoop 程式所需要的所
有函式庫，如：jetty、kfs。
 src / Hadoop 的原始碼。
 build / 開發Hadoop 編譯後的資料夾。
 logs / 預設的日誌檔所在目錄。（可更改路徑）

Update to who want to use Hadoop
 $ sudo joe /home/csa/.bashrc

 # Set Hadoop-related environment variables
export HADOOP_HOME=/home/csa/hadoop
 # Add Hadoop bin/ directory to PATH export
PATH=$PATH:$HADOOP_HOME/bin

Configuration
Change the Sun JDK/JRE 6 directory

 $ joe /hadoop/conf/hadoop-env.sh

 # The java implementation to use. Required.
 export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.24

Configuration
 In file conf/core-site.xml

 In file conf/core-site.xml

 In file conf/mapred-site.xml

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary irectories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. </description>
</property>

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description> For MapReduce job tracker </description>
</property>

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of
replications can be specified when the file is created. The default is used
if replication is not specified in create time. </description>
</property>

Formatting the name node!
 $ /home/csa/bin/hadoop namenode -format

Starting your single-node cluster

 $ /home/csa/hadoop/bin/start-all.sh
 $ jps

Jps
JobTracker
TaskTracker
NameNode
DataNode

Congratulation!
 You just setup a single-node cluster

Hadoop Web Interfaces
 http://localhost:50030/
– web UI for MapReduce job tracker(s)
– web UI for task tracker(s)
– web UI for HDFS name node(s)

常用指令
 操作 hadoop 檔案系統指令
 $ bin/hadoop fs -Instruction …

MapReduce Demo
 WordCount

Why wordcount ?
 Google
 Facebook

參考資料來源
Thanks for …
 NCHC Cloud Computing Research
Group ( Link here ! )

Thanks for your listening

Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (6)

Semelhante a Hadoop

Semelhante a Hadoop (20)

Último

Último (20)

Hadoop