Anúncio
Anúncio

Mais conteúdo relacionado

Anúncio
Anúncio

Hadoop File system (HDFS)

  1. Hadoop HDFS
  2.  What is Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  HDFS Architecture  Anatomy of a File Write and Read Topics of the Day Slide 2
  3. What is Big Data? Slide 3  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. cloud tools statistics No SQL compression storage support database analyze information terabytes mobile processing Big Data
  4. What is Big Data? Slide 4  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. NYSE generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades.
  5. Un – Structured Data is Exploding Slide 5
  6.  IBM’s Definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Web logs Videos Images Audios Sensor Data Slide 6 Volume Velocity Variety IBM’s Definition
  7.  Web and e-tailing  Recommendation Engines  Ad Targeting  Search Quality  Abuse and Click Fraud Detection  Telecommunications  Customer Churn Prevention  Network Performance Optimization  Calling Data Record (CDR) Analysis  Analyzing Network to Predict Failure Common Big Data Customer Scenarios Slide 7
  8.  Government  Fraud Detection and Cyber Security  Welfare Schemes  Justice  Healthcare & Life Sciences  Health Information Exchange  Gene Sequencing  Serialization  Healthcare Service Quality Improvements  Drug Safety Common Big Data Customer Scenarios(Contd.) Slide 8
  9. Common Big Data Customer Scenarios Contd.) Slide 9  Banks and Financial services  Modeling True Risk  Threat Analysis  Fraud Detection  Trade Surveillance  Credit Scoring and Analysis  Retail  Point of Sales Transaction Analysis  Customer Churn Analysis  Sentiment Analysis
  10. Hidden Treasure Case Study: Sears Holding Corporation X *Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data. Slide 17  Insight into data can provide Business Advantage.  Some key early indicators can mean Fortunes to Business.  More Precise Analysis with more data.
  11. Mostly Append Collection Instrumentation http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038? 90% of the ~2PB Archived Storage Processing BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid 3. Premature data death 1. Can’t explore original high fidelity raw data 2. Moving data to compute doesn’t scale A meagre 10% of the ~2PB Data is available for BI Storage only Grid (original Raw Data) Limitations of Existing Data Analytics Architecture Slide 11
  12. *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. RDBMS (Aggregated Data) No Data Archiving 1. Data Exploration & Advanced analytics 2. Scalable throughput for ETL & aggregation 3. Keep data alive forever Mostly Append Collection Instrumentation BI Reports + Interactive Apps Hadoop : Storage + Compute Grid Entire ~2PB Data is available for processing Both Storage And Processing Solution: A Combined Storage Computer Layer Slide 12
  13. Accessible RobustSimple Scalable Differentiating Factors Hadoop Differentiating Factors Slide 13
  14. RDBMS EDW Hadoop – It’s about Scale and Structure Slide 14 MPP NoSQL HADOOP Structured Data Types Multi and Unstructured Limited, No Data Processing Processing Processing coupled with Data Standards & Structured Governance Loosely Structured Required On write Schema Required On Read Software License Cost Support Only Known Entity Resources Growing, Complexities, Wide Interactive OLAP Analytics Operational Data Store Best Fit Use Massive Storage/Processing
  15. Read 1 TB Data 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machines 4 I/O Channels Each Channel – 100 MB/s Why DFS? Slide 15
  16. Read 1 TB Data 10 Machines 4 I/O Channels Each Channel – 100 MB/s Why DFS? Slide 16 45 Minutes 1 Machine 4 I/O Channels Each Channel – 100 MB/s
  17. Read 1 TB Data 10 Machines 4 I/O Channels Each Channel – 100 MB/s Why DFS? Slide 17 45 Minutes 4.5 Minutes 1 Machine 4 I/O Channels Each Channel – 100 MB/s
  18. What is Hadoop? Slide 18  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing.
  19. Apache Oozie (Workflow) MapReduce Framework HBase HDFS (Hadoop Distributed File System) Flume Sqoop Import Or Export Pig Latin Data Analysis Mahout Machine Learning Hive DW System Unstructured or Semi-Structured data Structured Data Hadoop Eco-System Slide 19
  20. Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker MapReduce Engine HDFS Cluster Job Tracker Admin Node Name node Hadoop Core Components Slide 20
  21. Hadoop Core Components Slide 21 Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers
  22. HDFS Definition Slide 22 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. It has many similarities with existing distributed file systems. Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets • HDFS consists of following components (daemons) – Name Node – Data Node – Secondary Name Node
  23. HDFS Components Slide 23 Namenode: NameNode, a master server, manages the file system namespace and regulates access to files by clients. Maintains and Manages the blocks which are present on the datanode.  Meta-data in Memory – The entire metadata is in main memory  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor  A Transaction Log – Records file creations, file deletions. Etc
  24. HDFS Components Slide 24 Data Node: Slaves which are deployed on each mechine and provide the actual storage. DataNodes, one per node in the cluster, manages storage attached to the nodes that they run on  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients – Block Report – Periodically sends a report of all existing blocks to the NameNode  Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  25. Understanding the File system Block placement • Current Strategy  One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed • Clients read from nearest replica Data Correctness • Use Checksums to validate data  Use CRC32 • File Creation  Client computes checksum per 512 byte  DataNode stores the checksum • File access  Client retrieves the data and checksum from DataNode
  26. Understanding the File system Data pipelining  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file Rebalancer • Goal: % of disk occupied on Datanodes should be similar  Usually run when new Datanodes are added  Cluster is online when Rebalancer is active  Rebalancer is throttled to avoid network congestion  Command line tool
  27. Secondary NameNode Slide 27  Secondary NameNode: Not a hot standby for the NameNode Connects to NameNode every hour* Housekeeping, backup of NemeNode metadata Saved metadata can build a failed NameNode
  28. Metadata (Name, replicas,…) Rack 1 Blocks DataNodes Block ops Replication Write DataNodes Metadata ops Client Read NameNode Rack 2 Client HDFS Architecture Slide 28
  29. NameNode DataNode DataNode DataNode DataNode 2. Create 7. Complete 5. ack Packet4. Write Packet Pipeline of Data nodes DataNode DataNode Distributed File System HDFS Client 1. Create 3. Write 6. Close Anatomy of A File Write Slide 29
  30. NameNode NameNode DataNode DataNode DataNode DataNode 2. Get Block locations 4. Read DataNode DataNode Distributed File System HDFS Client 1. Open 3. Read 5. Read FS Data Input Stream 6. Close Client JVM Client Node Anatomy of A File Read Slide 30
  31. Replication and Rack Awareness Slide 31
  32. Task TrackerTask Tracker Task TrackerTask TrackerData Node Task Tracker Client HDFS Name Node Task Tracker Task Tracker Data Node Data Node Map Reduce Job Tracker Hadoop Cluster Architecture (Contd.) Data Node Data Node Task Tracker Data Node Data Node Data Node
  33. Data Node RAM: 16GB Hard disk: 6 X 2TB Processor: Xenon with 2 cores. Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Name Node RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply Secondary Name Node RAM: 32 GB, Hard disk: 1 TB Processor: Xenon with 4 Cores Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply Data Node RAM: 16GB Hard disk: 6 X 2TB Processor: Xenon with 2 cores. Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Hadoop Cluster: A Typical Use Case
  34. http://wiki.apache.org/hadoop/PoweredBy Hadoop Cluster: Facebook
  35. Hadoop can run in any of the following three modes: Standalone (or Local) Mode  No daemons, everything runs in a single JVM.  Suitable for running MapReduce programs during development.  Has no DFS. Pseudo-Distributed Mode  Hadoop daemons run on the local machine. Fully-Distributed Mode  Hadoop daemons run on a cluster of machines. Hadoop Cluster Modes
  36. Core HDFS core-site.xml hdfs-site.xml mapred-site.xml Map Reduce Hadoop 1.x: Core Configuration Files
  37. Hadoop Configuration Files Configuration Filenames Description of Log Files hadoop-env.sh Environment variables that are used in the scripts to run Hadoop. core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce. hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes. mapred-site.xml Configuration settings for MapReduce daemons : the job-tracker and the task-trackers. masters A list of machines (one per line) that each run a secondary namenode. slaves A list of machines (one per line) that each run a datanode and a task-tracker.
  38. core-site.xml and hdfs-site.xml hdfs-site.xml core-site.xml <?xml version - "1.0"?> <?xml version ="1.0"?> <!--hdfs-site.xml--> <!--core-site.xml--> <configuration> <configuration> <property> <property> <name>dfs.replication</name> <name>fs.default.name</name> <value>1</value> <value>hdfs://localhost:8020/</value> </property> </property> </configuration> </configuration>
  39. Defining HDFS Details In hdfs-site.xml Property Value Description dfs.data.dir <value> /disk1/hdfs/data, /disk2/hdfs/data </value> A list of directories where the datanode stores blocks. Each block is stored in only one of these directories. ${hadoop.tmp.dir}/dfs/data fs.checkpoint.dir <value> /disk1/hdfs/namesecondary, /disk2/hdfs/namesecondary </value> A list of directories where the secondary namenode stores checkpoints. It stores a copy of the checkpoint in each directory in the list ${hadoop.tmp.dir}/dfs/namesecondary
  40. mapred-site.xml mapred-site.xml <?xml version=“1.0”?> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> <property> </configuration>
  41. Defining mapred-site.xml Property Value Description mapred.job.tracker <value> localhost:8021 </value> The hostname and the port that the jobtracker RPC server runs on. If set to the default value of local, then the jobtracker runs in-process on demand when you run a MapReduce job. mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores intermediate data for jobs. The data is cleared out when the job ends. mapred.system.dir ${hadoop.tmp.dir}/mapred/syste m The directory relative to fs.default.name where shared files are stored, during a job run. mapred.tasktracker.map.tasks. maximum 2 The number of map tasks that may be run on a tasktracker at any one time mapred.tasktracker.reduce.tasks .maximum 2 The number of reduce tasks tat may be run on a tasktracker at any one time.
  42. 1. http://hadoop.apache.org/docs/r1.1.2/core-default.html 2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html 3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html All Properties
  43. Masters  Contains a list of hosts, one per line, that are to host Secondary NameNode servers. Slaves and Masters Two files are used by the startup and shutdown commands: Slaves  Contains a list of hosts, one per line, that are to host DataNode and TaskTracker servers.
  44.  This file also offers a way to provide custom parameters for each of the servers.  Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/ directory of the installation.  Examples of environment variables that you can specify: export HADOOP_DATANODE_HEAPSIZE="128" export HADOOP_TASKTRACKER_HEAPSIZE="512" Set parameter JAVA_HOME JVMhadoop-env.sh Per-Process Runtime Environment
  45.  NameNode status: http://localhost:50070/dfshealth.jsp  JobTracker status: http://localhost:50030/jobtracker.jsp  TaskTracker status: http://localhost:50060/tasktracker.jsp  DataBlock Scanner Report: http://localhost:50075/blockScannerReport Web UI URLs
  46. Practice HDFS Commands
  47. Open a terminal window to the current working directory. # /home/notroot ----------------------------------------------------------------- # 1. Print the Hadoop version hadoop version ----------------------------------------------------------------- # 2. List the contents of the root directory in HDFS hadoop fs -ls / ----------------------------------------------------------------- # 3. Report the amount of space used and available on currently mounted filesystem hadoop fs -df hdfs:/ ----------------------------------------------------------------- # 4. Count the number of directories,files and bytes under the paths that match the specified file pattern hadoop fs -count hdfs:/
  48. 5. Run a DFS filesystem checking utility hadoop fsck – / ----------------------------------------------------------------- # 6. Run a cluster balancing utility hadoop balancer ----------------------------------------------------------------- # 7. Create a new directory named “hadoop” below the /user/training directory in HDFS. hadoop fs -mkdir /user/training/hadoop ----------------------------------------------------------------- # 8. Add a sample text file from the local directory named “data” to the new directory you created in HDFS hadoop fs -put data/sample.txt /user/training/hadoop
  49. 9. List the contents of this new directory in HDFS. hadoop fs -ls /user/training/hadoop ----------------------------------------------------------------- # 10. Add the entire local directory called “retail” to the /user/training directory in HDFS. hadoop fs -put data/retail /user/training/hadoop ----------------------------------------------------------------- # 11. Since /user/training is your home directory in HDFS, any command that does not have an absolute path is interpreted as relative to that directory. # The next command will therefore list your home directory, and should show the items you’ve just added there. hadoop fs -ls ----------------------------------------------------------------- # 12. See how much space this directory occupies in HDFS. hadoop fs -du -s -h hadoop/retail
  50. 13. Delete a file ‘customers’ from the “retail” directory. hadoop fs -rm hadoop/retail/customers ----------------------------------------------------------------- # 14. Ensure this file is no longer in HDFS. hadoop fs -ls hadoop/retail/customers ----------------------------------------------------------------- # 15. Delete all files from the “retail” directory using a wildcard. hadoop fs -rm hadoop/retail/* ----------------------------------------------------------------- # 16. To empty the trash hadoop fs –expunge ----------------------------------------------------------------- # 17. Finally, remove the entire retail directory and all of its contents in HDFS. hadoop fs -rm -r hadoop/retail
  51. # 18. Add the purchases.txt file from the local directory named “/home/training/” to the hadoop directory you created in HDFS hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/ ----------------------------------------------------------------- # 19. To view the contents of your text file purchases.txt which is present in your hadoop directory. hadoop fs -cat hadoop/purchases.txt ----------------------------------------------------------------- # 20. Move a directory from one location to other hadoop fs -mv hadoop apache_hadoop ----------------------------------------------------------------- # 21. Add the purchases.txt file from “hadoop” directory which is present in HDFS directory to the directory “data” which is present in your local directory hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data
  52. # 26. Default names of owner and group are training,training # Use ‘-chown’ to change owner name and group name simultaneously hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt ----------------------------------------------------------------- # 27. Default name of group is training # Use ‘-chgrp’ command to change group name hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt ----------------------------------------------------------------- # 28. Default replication factor to a file is 3. # Use ‘-setrep’ command to change replication factor of a file hadoop fs -setrep -w 2 apache_hadoop/sample.txt
  53. 29. Copy a directory from one node in the cluster to another # Use ‘-distcp’ command to copy, # Use '-overwrite' option to overwrite in an existing files # Use '-update' command to synchronize both directories hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop ----------------------------------------------------------------- # 30. Command to make the name node leave safe mode hadoop fs -expunge sudo -u hdfs hdfs dfsadmin -safemode leave ----------------------------------------------------------------- # 31. List all the hadoop file system shell commands hadoop fs ----------------------------------------------------------------- # 33. Last but not least, always ask for help! hadoop fs -help
  54. Thank You
Anúncio