29 de Mar de 2016
1 de 29

• 1. Presented by:- Swati Chaturvedi B.Tech, 3rd Year Department of Information Technology
• 2. IntroductionIntroduction Big Data: •Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. •Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
• 3. • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.  • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.
• 4. So What Is The Problem?So What Is The Problem?  The transfer speed is around 100 MB/s  A standard disk is 1 Terabyte  Time to read entire disk= 10000 seconds or 3 Hours!  Increase in processing time may not be as helpful because • Network bandwidth is now more of a limiting factor • Physical limits of processor chips have been reached
• 5. So What do We Do?So What do We Do? •The obvious solution is that we use multiple processors to solve the same problem by fragmenting it into pieces. •Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
• 6. Distributed ComputingDistributed Computing The key issues involved in this Solution: Hardware failure Combine the data after analysis Network Associated Problems
• 8. Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem. The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets.
• 9. What Else is Hadoop?What Else is Hadoop? A reliable shared storage and analysis system. There are other subprojects of Hadoop that provide complementary services, or build on the core to add higher-level abstractions. The project includes these modules:  Hadoop Common: The common utilities that support the other Hadoop modules.  Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.  Hadoop YARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
• 10. Other Hadoop-related projects at Apache include: Ambari Avro Cassandra Chukwa  Hbase ZooKeeper Hive Mahout Spark Tez
• 11. Hadoop Approach to DistributedHadoop Approach to Distributed ComputingComputing  The theoretical 1000-CPU machine would cost a very large amount of money, far more than 1,000 single-CPU.  Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.  Hadoop provides a simplified programming model which allows the user to quickly write and test distributed systems, and its’ efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.
• 12.  MapReduce is a programming model  Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines  MapReduce is an associated implementation for processing and generating large data sets.
• 13. The Programming Model Of MapReduceThe Programming Model Of MapReduce Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.
• 14.  The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values
• 15. How MapReduce WorksHow MapReduce Works  A Map-Reduce job usually splits the input data- set into independent chunks which are processed by the map tasks in a completely parallel manner.  The framework sorts the outputs of the maps, which are then input to the reduce tasks.  Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.  A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks
• 17. Fault ToleranceFault Tolerance  There are two types of nodes that control the job execution process: tasktrackers and jobtrackers  The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.  Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job.  If a tasks fails, the jobtracker can reschedule it on a different tasktracker.