8. Blocks & Block Caching
Block size is the minimum amount of data that it can read or write
Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes
HDFS, too, has the concept of a block, but it is a much larger unit—128 MB by default.
Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are
stored as independent units.
hdfs fsck /user/file.txt -files –blocks
for frequently-accessed files the blocks may be explicitly cached in the datanode’s memory, in an
off-heap block cache.
By default a block is cached in only one datanode’s memory
dfs.datanode.max.locked.memory property used to set max lock memory
Usinf hdfs cacheadmin option we add cachepool, add directory, and also we can give TTL(time –
to-live)
hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication
<replication>] [-ttl <time-to-live>]
10. Copy with distcp for data backups
hadoop distcp file1 file2
hadoop distcp dir1 dir2
hadoop distcp -update dir1 dir2 <If you are unsure of the effect of a distcp operation>
hadoop distcp -update -delete -p hdfs://namenode1/foo hdfs://namenode2/foo
The -delete flag causes distcp to delete any files or directories from the destination that
are not present in the source, and -p means that file status attributes like permissions,
block size and replication are preserved.
11. Commissioning and Decommissioning
Nodes Commissioning new nodes
Add the network addresses of the new nodes to the include file.
Update the namenode with the new set of permitted datanodes using thiscommand:
% hdfs dfsadmin -refreshNodes
Update the resource manager with the new set of permitted node managers using:
% yarn rmadmin -refreshNodes
Update the slaves file with the new nodes, so that they are included in future operations performed by the Hadoop control
scripts.
Start the new datanodes and node managers.
Check that the new datanodes and node managers appear in the web UI.
Decommissioning old nodes
HDFS is set by the dfs.hosts.exclude property and for YARN by the yarn.resourcemanager.nodes.exclude-path property.
Update the namenode with the new set of permitted datanodes, using this command:
% hdfs dfsadmin -refreshNodes
Update the resource manager with the new set of permitted node managers using:
% yarn rmadmin -refreshNodes
12. MapReduce inspiration
The name MapReduce comes from functional programming
- Map is the name of a higher-order function that applies a given function
to each element of a list. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.map(x => x * x) == List(1,4,9,16,25)
- Reduce is the name of a higher-order function that analyze a recursive
data structure and recombine through use of a given combining
operation the results of recursively processing its constituent parts,
building up a return value. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.reduce(_ + _) == 15
Note: MapReduce takes an input, splits it into smaller parts, execute the code of
the mapper on every part, then gives all the results to one or more reducers
that merge all the results into one.