2. → What's the “Need” ? ←
❏ Big data Ocean
❏ Expensive hardware
❏ Frequent Failures and Difficult recovery
❏ Scaling up with more machines
2
3. → Hadoop ←
Open source software
- a Java framework
- initial release: December 10, 2011
It provides both,
Storage → [HDFS]
Processing → [MapReduce]
HDFS: Hadoop Distributed File System
3
4. → How Hadoop addresses the need? ←
Big data Ocean
Have multiple machines. Each will store some portion of data, not the entire data.
Expensive hardware
Use commodity hardware. Simple and cheap.
Frequent Failures and Difficult recovery
Have multiple copies of data. Have the copies in different machines.
Scaling up with more machines
If more processing is needed, add new machines on the fly 4
5. → HDFS ←
Runs on Commodity hardware: Doesn't require expensive machines
Large Files; Write-once, Read-many (WORM)
Files are split into blocks
Actual blocks go to DataNodes
The metadata is stored at NameNode
Replicate blocks to different node
Default configuration:
Block size = 128MB
Replication Factor = 3 5
9. → Where NOT TO use Hadoop/HDFS ←
Low latency data access
HDFS is optimized for high throughput of data at the expense of latency.
Large number of small files
Namenode has the entire file-system metadata in memory.
Too much metadata as compared to actual data.
Multiple writers / Arbitrary file modifications
No support for multiple writers for a file
Always append to end of a file
9
11. → NameNode & DataNodes ←
❏NameNode:
Centerpiece of HDFS: The Master
Only stores the block metadata: block-name, block-location etc.
Critical component; When down, whole cluster is considered down; Single point of failure
Should be configured with higher RAM
❏DataNode:
Stores the actual data: The Slave
In constant communication with NameNode
When down, it does not affect the availability of data/cluster
11
13. → JobTracker & TaskTrackers ←
❏JobTracker:
Talks to the NameNode to determine location of the data
Monitors all TaskTrackers and submits status of the job back to the client
When down, HDFS is still functional; no new MR job; existing jobs halted
Replaced by ResourceManager/ApplicationMaster in MRv2
❏TaskTracker:
Runs on all DataNodes
TaskTracker communicates with JobTracker signaling the task progress
TaskTracker failure is not considered fatal
13
14. → ResourceManager & NodeManager ←
❏Present in Hadoop v2.0
❏Equivalent of JobTracker & TaskTracker in v1.0
❏ResourceManager (RM):
Runs usually at NameNode; Distributes resources among applications.
Two main components: Scheduler and ApplicationsManager (AM)
❏NodeManager (NM):
Per-node framework agent
Responsible for containers
Monitors their resource usage 14
18. → Interacting with HDFS ←
Command prompt:
Similar to Linux terminal commands
Unix is the model, POSIX is the API
Web Interface:
Similar to browsing a FTP site on web
18
20. → Notes ←
File Paths on HDFS:
hdfs://<namenode>:<port>/path/to/file.txt
hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt
hdfs://localhost:8020/user/USERNAME/demo/data/file.txt
/user/USERNAME/demo/file.txt
demo/file.txt
File System:
Local: local file system (linux)
HDFS: hadoop file system
At some places: 20
27. 3. Create a file on local & put it on HDFS
Syntax:
vi filename.txt
hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>
Example:
vi file-copy-to-hdfs.txt
hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-example/
27
28. 4. Get a file from HDFS to local
Syntax:
hdfs dfs -get <hdfs-file-path> [local-dir-path]
Example:
hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-hdfs.txt ~/demo/
28
29. 5. Copy From LOCAL To HDFS
Syntax:
hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>
Example:
hdfs dfs -copyFromLocal file-copy-to-hdfs.txt /user/USERNAME/demo/copyFromLocal-example/
29
30. 6. Copy To LOCAL From HDFS
Syntax:
hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>
Example:
hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-example/file-copy-from-hdfs.txt
~/demo/
30
31. 7. Move a file from local to HDFS
Syntax:
hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>
Example:
hdfs dfs -moveFromLocal /path/to/file.txt /user/USERNAME/demo/moveFromLocal-example/
31
32. 8. Copy a file within HDFS
Syntax:
hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>
Example:
hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt
/user/USERNAME/demo/data/
32
33. 9. Move a file within HDFS
Syntax:
hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>
Example:
hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt
/user/USERNAME/demo/data/
33
41. 17. Get FileSystem Statistics
Syntax:
hdfs dfs -stat [format] <hdfs-file-path>
Format Options:
%b - file size in blocks, %g - group name of owner
%n - filename %o - block size
%r - replication %u - user name of owner
%y - modification date
41
59. Copy data from one cluster to another
Description:
Copy data between hadoop clusters
Syntax:
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo
hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo
Where srclist.file contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b
59
Notas do Editor
Commodity Hardware:
-affordable and easy to obtain
-capable of running Windows, Linux, or MS-DOS without requiring any special devices or equipment
-broadly compatible and can function on a plug and play basis
-low-end but functional product without distinctive features
BLOCK:
A physical storage disk has a block size - minimum amount of data it can read or write. Normally 512 bytes.
File systems for a single disk also deal with data in blocks. Normally few kilo bytes (4 kb).
Hadoop has a much larger block size. By default it is 64 mb.
Files in HDFS are broken down into block sized chunks and are stored as independent units.
However, files smaller than a block size do not occupy the entire block.
Why so large blocks??
> NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.
> NameNode knows the list of the blocks and its location for any given file in HDFS.
> With this information NameNode knows how to construct the file from blocks.
JobTracker finds the best TaskTracker nodes to execute tasks based on:
-data locality
-available slots to execute a task on a given node
HDFS High Availability
Namenode metadata is written to a shared storage (Journal Manager)
Only one active NN can write to shared storage
Passive NNs read & replay metadata from shared storage
When active NN fails, one of the passive NNs is promoted to active
Snapshot:
Able to store a checkpointed stage of hdfs ()
Improved performance:
Multithreaded random read
HDFS v1: 264MB/sec
HDFS v2: 1395MB/sec (about 5X !!)
Federation
Namenode stores metadata in memory
For very large files, namenode could exaust memory
Spread metadata over multiple namenodes
Details about HDFS Federation.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
Everything you need to know about hdfs commands:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
Description:
List the contents that match the specified pattern.
If path is not specified, the contents of /user/<current_user> are listed
Options:
-d Directories are listed as plain files.
-h Formats the sizes of files in a human-readable fashion, rather than a number of bytes.
-R Recursively list the contents of directories.
Output:
(<permissions> <-/#replicas> <userid> <groupid> <size(in bytes)> <modification_date> <directoryName/fileName>)
Description:
Create a directory in specified location.
Options:
-p : Create directories in the specified path, if does not exist
Description:
Copy files into fs.
Options:
-f : If the file already exists, copying does not fail & the destination is overwritten.
-p : Preserves access time, modification time, ownership and the mode.
Description:
Copy files from hdfs. When copying multiple files, the destination must be a directory.
Options:
-p : Preserves access time, modification time, ownership and the mode.
-ignorecrc : Files that fail the CRC check may be copied with this option.
-crc : Files and CRCs may be copied using this option.
Description:
Copy files from local filesystem into hdfs.
It is same as “put” command but more specific w.r.t local filesystem
Options:
-f : If the file already exists, copying does not fail & the destination is overwritten.
-p : Preserves access time, modification time, ownership and the mode.
Description:
Copy files from hdfs to local filesystem. When copying multiple files, the destination must be a directory.
It is same as “get” command but more specific w.r.t local filesystem
Options:
-p : Preserves access time, modification time, ownership and the mode.
-ignorecrc : Files that fail the CRC check may be copied with this option.
-crc : Files and CRCs may be copied using this option.
Description:
Same as -copyFromLocal, except that the source is deleted after it's copied.
Description:
Copy files from hdfs to the same hdfs. File pattern can be specified. When copying multiple files, the destination must be a directory.
Options:
-f : If the file already exists, copying does not fail & the destination is overwritten.
-p : Preserves access time, modification time, ownership and the mode.
Description:
Move files from hdfs to the same hdfs. File pattern can be specified. When moving multiple files, the destination must be a directory.
Description:
Get all the files in the directories that match the source file pattern and
merge and sort them to only one file on local fs. <src> is kept.
Options:
-nl Add a newline character at the end of each file.
-cat [-ignoreCrc] <src> ... :
Fetch all files that match the file pattern <src> and display their content on
stdout.
-tail [-f] <file> :
Show the last 1KB of the file.
-f Shows appended data as the file grows.
-text [-ignoreCrc] <src> ... :
Takes a source file and outputs the file in text format.
The allowed formats are zip and TextRecordInputStream and Avro.
Description:
Delete all files that match the specified file pattern. Equivalent to the Unix command "rm <src>"
Options:
-skipTrash option bypasses trash, if enabled, and immediately deletes <src>
-f If the file does not exist, do not display a diagnostic message or
modify the exit status to reflect an error.
-[rR] Recursively deletes directories
-chgrp [-R] GROUP PATH... :
This is equivalent to -chown ... :GROUP ...
-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH... :
Changes permissions of a file. This works similar to the shell's chmod command with a few exceptions.
-R modifies the files recursively. This is the only option currently supported.
<MODE> Mode is the same as mode used for the shell's command.
-chown [-R] [OWNER][:[GROUP]] PATH... :
Changes owner and group of a file. This is similar to the shell's chown command with a few exceptions.
-R modifies the files recursively. This is the only option currently supported.
-du [-s] [-h] <path> ... :
Show the amount of space, in bytes, used by the files that match the specified file pattern. The following flags are optional:
-s Rather than showing the size of each individual file that matches the pattern, shows the total (summary) size.
-h Formats the sizes of files in a human-readable fashion rather than a number of bytes.
Note that, even without the -s option, this only shows size summaries one level deep into a directory.
The output is in the form
size disk space consumed name(full path)
-touchz <path> ... :
Creates a file of zero length at <path> with current time as the timestamp of
that <path>. An error is returned if the file exists with non-zero length
Options:
-d: f the path is a directory, return 0.
-e: if the path exists, return 0.
-f: if the path is a file, return 0.
-s: if the path is not empty, return 0.
-z: if the file is zero length, return 0.
-stat [format] <path> ... :
Print statistics about the file/directory at <path> in the specified format.
Format accepts filesize in blocks (%b), group name of owner(%g), filename (%n),
block size (%o), replication (%r), user name of owner(%u), modification date
(%y, %Y)
-count [-q] [-h] [-v] <path> ... :
Count the number of directories, files and bytes under the paths that match the specified file pattern.
The -h option shows file sizes in human readable format.
The -v option displays a header line.
The output columns are:
DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
or, with the -q option:
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
-setrep [-R] [-w] <rep> <path> ... :
Set the replication level of a file. If <path> is a directory then the command
recursively changes the replication factor of all files under the directory tree
rooted at <path>.
-w It requests that the command waits for the replication to complete. This
can potentially take a very long time.
-R It is accepted for backwards compatibility. It has no effect.
The block size specified by dfs.blocksize should be multiple of 512
-copyFromLocal error if block size is not valid:
Invalid values: dfs.bytes-per-checksum (=512) must divide block size (=104857601).
-expunge :
Empty the Trash
To enable hdfs thrash set fs.trash.interval > 1 in core-site.xml
Deleted data goes in hdfs folder at : /user/<username>/.Trash/
Options:
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-includeSnapshots include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
-list-corruptfileblocks print out list of missing blocks and files they belong to
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
Command:
hdfs dfsadmin -help report
Description:
Reports basic filesystem information and statistics.
The dfs usage can be different from "du" usage, because it
measures raw space used by replication, checksums, snapshots
and etc. on all the DNs.
Optional flags may be used to filter the list of displayed DNs.
Options:
-report [-live] [-dead] [-decommissioning]
hadoop getconf
[-namenodes] gets list of namenodes in the cluster.
[-secondaryNameNodes] gets list of secondary namenodes in the cluster.
[-backupNodes] gets list of backup nodes in the cluster.
[-includeFile] gets the include file path that defines the datanodes that can join the cluster.
[-excludeFile] gets the exclude file path that defines the datanodes that need to decommissioned.
[-nnRpcAddresses] gets the namenode rpc addresses
[-confKey [key]] gets a specific key from the configuration