O slideshow foi denunciado.

Hadoop Distributed File System

3

Compartilhar

Próximos SlideShares
Hadoop Distributed File System
Hadoop Distributed File System
Carregando em…3
×
1 de 16
1 de 16

Hadoop Distributed File System

3

Compartilhar

Baixar para ler offline

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The core of Apache Hadoop consists of a storage part (HDFS) and a processing part (MapReduce).

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The core of Apache Hadoop consists of a storage part (HDFS) and a processing part (MapReduce).

Mais Conteúdo rRelacionado

Audiolivros relacionados

Gratuito durante 14 dias do Scribd

Ver tudo

Hadoop Distributed File System

  1. 1. Hadoop DFS Rutvik Bapat (12070121667)
  2. 2. About Hadoop • Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. • It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. • The core of Apache Hadoop consists of a storage part (HDFS) and a processing part (MapReduce). • Developer – Apache Software Foundation • Written in Java
  3. 3. Benefits • Computing power – Distributed computing model ideal for big data • Flexibility – Store any amount of any kind of data. • Fault Tolerance - If a node goes down, jobs are automatically redirected to other nodes. And it automatically stores multiple copies/replicas of all data. • Low Cost - The open-source framework is free and uses commodity hardware to store large quantities of data. • Scalability – System can be grown easily by adding more nodes.
  4. 4. HDFS Goals • Detection of faults and automatic recovery. • High throughput of data access rather than low latency. • Provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. • Write-once-read-many access model for files. • Applications move themselves closer to where the data is located. • Easily portable.
  5. 5. Some Nomenclature • A Rack is a collection of nodes that are physically stored close together and are all on the same network. • A Cluster is a collection of racks. • NameNode – Manages the files system namespace and regulates access to clients. There is a single NameNode for a cluster. • DataNode – Serves read, write requests, and performs block creation, deletion, and replication upon instruction from NameNode. • A file is split in one or more blocks and a set of blocks are stored in DataNodes. • A Hadoop block is a file on the underlying file system. Default size 64 MB. All blocks in a file except the last block are the same size.
  6. 6. MapReduce – Heart of Hadoop
  7. 7. A Master-Slave Architecture
  8. 8. Replica Management • The NameNode keeps track of the rack id each DataNode belongs to. • The default replica placement policy is as follows: • One third of replicas are on one node • Two thirds of replicas (including the above) are on one rack • The other third are evenly distributed across the remaining racks. • This policy improves write performance without compromising data reliability or read performance. • HDFS tries to satisfy a read request from a replica that is closest to the reader.
  9. 9. NameNode • Stores the HDFS namespace • Record every change to file system metadata in a transaction log called EditLog • The namespace, including the mapping of blocks to files and file system properties, is stored in a file called FsImage • Both EditLog and FsImage are stored on the NameNode’s local file system • Keeps an image of the namespace and file blockmap in memory
  10. 10. NameNode • On startup • Reads FsImage and EditLog from the disk • Applies all transactions from the EditLog to the in-memory copy of FsImage • Flushes the modified FsImage onto the disk • This is called checkpointing • Checkpointing only occurs when the NameNode starts up • Currently no checkpointing after startup • After checkpointing, the NameNode enters safemode
  11. 11. Safemode • Replication of data blocks does not occur in safemode • Receives Heartbeat and Blockreport from DataNodes • Blockreport contains list of data blocks at a DataNode • Each block has a specified minimum number of replicas • A block is considered safely replicated when the minimum number of replicas has checked in with the NameNode. • After a configurable percentage of safely replicated data blocks check in, the NameNode exits safemode. • Replicates all blocks that were not safely replicated.
  12. 12. DataNode • Stores HDFS data in files in its local file system • Has no knowledge about HDFS files • Stores each HDFS block in a separate file • Stores files in subdirectories instead of one single directory • On startup • Scans through local file system • Generates a list of all HDFS data blocks • Sends the report to the NameNode • This is called the Blockreport
  13. 13. Staging • A client request to create a file does not reach the NameNode immediately • Initially, the client caches file data into a temporary local file • Once the local file has data over one HDFS block size, the NameNode is contacted • The NameNode inserts the file name into the FS and allocates a data block to it • It replies with the identity of the DataNode and the destination data block • It also sends a list of the DataNodes replicating the block
  14. 14. Staging • The client then flushes the block of data to the DataNode. • When a file is closed, the remaining data is also flushed to the DataNode • It then tells the NameNode that the file is closed • The NameNode commits the file creation operation into a persistent store
  15. 15. Replication Pipelining • The client sends the data block to the DataNode in small portions • The DataNode writes each portion to its local filesystem • It then passes on the portion to another DataNode for replication as determined by the NameNode • Each DataNode, on receiving the portion, writes it to their filesystem and passes it to the next DataNode • This continues till it reaches the last DataNode holding a replica of the data block
  16. 16. Thank You!

×