Hadoop Interacting with HDFS

DataTorrent
HADOOP
Interacting with HDFS
1

→ What's the “Need” ? ←
❏ Big data Ocean
❏ Expensive hardware
❏ Frequent Failures and Difficult recovery
❏ Scaling up with more machines
2

→ Hadoop ←
Open source software
- a Java framework
- initial release: December 10, 2011
It provides both,
Storage → [HDFS]
Processing → [MapReduce]
HDFS: Hadoop Distributed File System
3

→ How Hadoop addresses the need? ←
Big data Ocean
Have multiple machines. Each will store some portion of data, not the entire data.
Expensive hardware
Use commodity hardware. Simple and cheap.
Frequent Failures and Difficult recovery
Have multiple copies of data. Have the copies in different machines.
Scaling up with more machines
If more processing is needed, add new machines on the fly 4

→ HDFS ←
Runs on Commodity hardware: Doesn't require expensive machines
Large Files; Write-once, Read-many (WORM)
Files are split into blocks
Actual blocks go to DataNodes
The metadata is stored at NameNode
Replicate blocks to different node
Default configuration:
Block size = 128MB
Replication Factor = 3 5

→ Where NOT TO use Hadoop/HDFS ←
Low latency data access
HDFS is optimized for high throughput of data at the expense of latency.
Large number of small files
Namenode has the entire file-system metadata in memory.
Too much metadata as compared to actual data.
Multiple writers / Arbitrary file modifications
No support for multiple writers for a file
Always append to end of a file
9

→ Some Key Concepts ←
❏NameNode
❏DataNodes
❏JobTracker (MR v1)
❏TaskTrackers (MR v1)
❏ResourceManager (MR v2)
❏NodeManagers (MR v2)
❏ApplicationMasters (MR v2)
10

→ NameNode & DataNodes ←
❏NameNode:
Centerpiece of HDFS: The Master
Only stores the block metadata: block-name, block-location etc.
Critical component; When down, whole cluster is considered down; Single point of failure
Should be configured with higher RAM
❏DataNode:
Stores the actual data: The Slave
In constant communication with NameNode
When down, it does not affect the availability of data/cluster
11

→ JobTracker & TaskTrackers ←
❏JobTracker:
Talks to the NameNode to determine location of the data
Monitors all TaskTrackers and submits status of the job back to the client
When down, HDFS is still functional; no new MR job; existing jobs halted
Replaced by ResourceManager/ApplicationMaster in MRv2
❏TaskTracker:
Runs on all DataNodes
TaskTracker communicates with JobTracker signaling the task progress
TaskTracker failure is not considered fatal
13

→ ResourceManager & NodeManager ←
❏Present in Hadoop v2.0
❏Equivalent of JobTracker & TaskTracker in v1.0
❏ResourceManager (RM):
Runs usually at NameNode; Distributes resources among applications.
Two main components: Scheduler and ApplicationsManager (AM)
❏NodeManager (NM):
Per-node framework agent
Responsible for containers
Monitors their resource usage 14

→ Hadoop 1.0 vs. 2.0 ←
HDFS 1.0:
Single point of failure
Horizontal scaling performance issue
HDFS 2.0:
HDFS High Availability
HDFS Snapshot
Improved performance
HDFS Federation
16

→ Interacting with HDFS ←
Command prompt:
Similar to Linux terminal commands
Unix is the model, POSIX is the API
Web Interface:
Similar to browsing a FTP site on web
18

Interacting With HDFS
On Command Prompt
19

→ Notes ←
File Paths on HDFS:
hdfs://<namenode>:<port>/path/to/file.txt
hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt
hdfs://localhost:8020/user/USERNAME/demo/data/file.txt
/user/USERNAME/demo/file.txt
demo/file.txt
File System:
Local: local file system (linux)
HDFS: hadoop file system
At some places: 20

→ Before we start ←
Command:
hdfs
Usage:
hdfs [--config confdir] COMMAND
Example:
hdfs dfs
hdfs dfsadmin
hdfs fsck
21

→ In general Syntax for `dfs` commands ←
hdfs
dfs
-<COMMAND>
-[OPTIONS]
<PARAMETERS>
e.g.
hdfs dfs -ls -R /user/USERNAME/demo/data/
23

0. Do It yourself
Syntax:
hdfs dfs -help [COMMAND … ]
hdfs dfs -usage [COMMAND … ]
Example:
hdfs dfs -help cat
hdfs dfs -usage cat
24

1. List the file/directory
Syntax:
hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>
Example:
hdfs dfs -ls
hdfs dfs -ls /
hdfs dfs -ls /user/USERNAME/demo/list-dir-example
hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example
25

2. Creating a directory
Syntax:
hdfs dfs -mkdir [-p] <hdfs-dir-path>
Example:
hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example
hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-example/dir1/dir2/dir3
26

3. Create a file on local & put it on HDFS
Syntax:
vi filename.txt
hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>
Example:
vi file-copy-to-hdfs.txt
hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-example/
27

4. Get a file from HDFS to local
Syntax:
hdfs dfs -get <hdfs-file-path> [local-dir-path]
Example:
hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-hdfs.txt ~/demo/
28

5. Copy From LOCAL To HDFS
Syntax:
hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>
Example:
hdfs dfs -copyFromLocal file-copy-to-hdfs.txt /user/USERNAME/demo/copyFromLocal-example/
29

6. Copy To LOCAL From HDFS
Syntax:
hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>
Example:
hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-example/file-copy-from-hdfs.txt
~/demo/
30

7. Move a file from local to HDFS
Syntax:
hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>
Example:
hdfs dfs -moveFromLocal /path/to/file.txt /user/USERNAME/demo/moveFromLocal-example/
31

8. Copy a file within HDFS
Syntax:
hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>
Example:
hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt
/user/USERNAME/demo/data/
32

9. Move a file within HDFS
Syntax:
hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>
Example:
hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt
/user/USERNAME/demo/data/
33

10. Merge files on HDFS
Syntax:
hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>
Examples:
hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/ /path/to/all-files.txt
34

11. View file contents
Syntax:
hdfs dfs -cat <hdfs-file-path>
hdfs dfs -tail <hdfs-file-path>
hdfs dfs -text <hdfs-file-path>
Examples:
hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt
hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head
35

12. Remove files/dirs from HDFS
Syntax:
hdfs dfs -rm [options] <hdfs-file-path>
Examples:
hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt
hdfs dfs -rm -R /user/USERNAME/demo/remove-example/
hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/
36

13. Change file/dir properties
Syntax:
hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>
hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>
hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>
Examples:
hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-properties.txt
37

14. Check the file size
Syntax:
hdfs dfs -du <hdfs-file-path>
Examples:
hdfs dfs -du /user/USERNAME/demo/data/file.txt
hdfs dfs -du -s -h /user/USERNAME/demo/data/
38

15. Create a zero byte file in HDFS
Syntax:
hdfs dfs -touchz <hdfs-file-path>
Examples:
hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt
39

16. File test operations
Syntax:
hdfs dfs -test -[defsz] <hdfs-file-path>
Examples:
hdfs dfs -test -e /user/USERNAME/demo/data/file.txt
echo $?
40

17. Get FileSystem Statistics
Syntax:
hdfs dfs -stat [format] <hdfs-file-path>
Format Options:
%b - file size in blocks, %g - group name of owner
%n - filename %o - block size
%r - replication %u - user name of owner
%y - modification date
41

18. Get File/Dir Counts
Syntax:
hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>
Example:
hdfs dfs -count -v /user/USERNAME/demo/
42

19. Set replication factor
Syntax:
hdfs dfs -setrep -w -R n <hdfs-file-path>
Examples:
hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt
43

20. Set Block Size
Syntax:
hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path> <hdfs-file-path>
Examples:
hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt
/user/USERNAME/demo/block-example/
44

21. Empty the HDFS trash
Syntax:
hdfs dfs -expunge
Location:
45

Other hdfs commands (admin)
46

22. HDFS Admin Commands: fsck
Syntax:
hdfs fsck <hdfs-file-path>
Options:
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots]
47

23. HDFS Admin Commands: dfsadmin
Syntax:
hdfs dfsadmin
Options:
[-report [-live] [-dead] [-decommissioning]]
[-safemode enter | leave | get | wait]
[-refreshNodes]
[-refresh <host:ipc_port> <key> [arg1..argn]]
[-shutdownDatanode <datanode:port> [upgrade]]
[-getDatanodeInfo <datanode_host:ipc_port>]
[-help [cmd]]
Examples:
49

25. HDFS Admin Commands: getconf
Syntax:
hdfs getconf [-options]
Options:
[ -namenodes ] [ -secondaryNameNodes ]
[ -backupNodes ] [ -includeFile ]
[ -excludeFile ] [ -nnRpcAddresses ]
[ -confKey [key] ]
52

Again,,, THE most important commands !!
Syntax:
hdfs dfs -help [options]
hdfs dfs -usage [options]
Examples:
hdfs dfs -help help
hdfs dfs -usage usage
53

Interacting With HDFS
In Web Browser
54

Web HDFS
URL:
http://namenode:50070/explorer.html
Examples:
http://localhost:50070/explorer.html
http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html
55

References
1. http://www.hadoopinrealworld.com
2. http://www.slideshare.net/sanjeeb85/hdfscommandreference
3. http://www.slideshare.net/jaganadhg/hdfs-10509123
4. http://www.slideshare.net/praveenbhat2/adv-os-presentation
5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf
7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html
8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
56

Copy data from one cluster to another
Description:
Copy data between hadoop clusters
Syntax:
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo
hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo
Where srclist.file contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b
59

Hadoop Interacting with HDFS

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Hadoop Interacting with HDFS

Semelhante a Hadoop Interacting with HDFS (20)

Mais de Apache Apex

Mais de Apache Apex (17)

Último

Último (20)

Hadoop Interacting with HDFS

Notas do Editor