O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Hadoop bigdata overview

1.246 visualizações

Publicada em

Hadoop and Bigdata basics

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Hadoop bigdata overview

  1. 1. Hadoop Haritha K
  2. 2. What is BigData?
  3. 3. What attributes to BigData?
  4. 4. What attributes to BigData…  Velocity  Variety  Volume
  5. 5. Solution ? Hadoop Hadoop is an open source framework for writing and running distributed applications that process large amounts of data on clusters of commodity hardware using simple programming model. History:  Google – 2004  Apache and Yahoo - 2009  Project Creator - Doug Cutting , named “hadoop” after his son’s yellow elephant doll.
  6. 6. Who are using Hadoop?
  7. 7. Why distributed computing ?
  8. 8. Why distributed computing ?......
  9. 9. Hadoop Assumptions It is written with large clusters of computers in mind and is built around the following assumptions:  Hardware will fail.  Processing will be run in batches.  Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size.  It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.  Applications need a write-once-read-many access model.  Moving Computation is Cheaper than Moving Data.
  10. 10. Hadoop Core Components  HDFS o Hadoop Distributed File System o Storage  Map Reduce o Execution engine o Computation.
  11. 11. Hadoop Architecture
  12. 12. Hadoop - Master/Slave Hadoop is designed as a master-slave shared-nothing architecture Master node (single node) Many slave nodes
  13. 13. HDFS Components  Name Node  Master of the system  Maintains and manages the blocks which are present in the data nodes.  Data Nodes  Slaves which are deployed on each machine  Provides the actual storage.  Responsible for providing read and write requests from client.
  14. 14. Rack Awareness
  15. 15. Main Properties of HDFS  Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data  Replication: Each data block is replicated many times (default is 3)  Failure: Failure is the norm rather than exception  Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS
  16. 16. Map Reduce  Programming model developed at Google  Sort/merge based distributed computing  The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success)
  17. 17. Map Reduce Components  Job Tracker is the master node (runs with the namenode)  Receives the user’s job  Decides on how many tasks will run (number of mappers)  Decides on where to run each mapper (concept of locality)  Task Tracker is the slave node (runs on each datanode)  Receives the task from Job Tracker  Runs the task until completion (either map or reduce task)  Always in communication with the Job Tracker reporting progress
  18. 18. How Map Reduce works ?  The run time partitions the input and provides it to different Map instances;  Map (key, value)  (key’, value’)  The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’.  Each Reduce produces a single (or zero) file output.  Map and Reduce are user written functions
  19. 19. Map Reduce Phases Deciding on what will be the key and what will be the value  developer’s responsibility
  20. 20. Example : Color Count Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0001 Part0002 Part0003 That’s the output file, it has 3 parts on probably 3 different machines
  21. 21. Hadoop vs. Other Systems Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Notion of jobs - Job is the unit of work - No concurrency control Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
  22. 22. Advantages  A Reliable shared storage.  Simple analysis system.  Distributed File System.  Tasks are independent.  Easy to handle partial failures - entire nodes can fail and restart.
  23. 23. Disadvantages  Lack of central data.  Single master node.  Managing job flow isn’t trivial when intermediate data should be kept.
  24. 24. Thank You………..