SlideShare uma empresa Scribd logo
1 de 20
1
CHAPTER-1
HADOOP ECHOSYSTEM 2.X.X
With a rapid pace in evolution of Big Data, its processing frameworks also seem to be evolving
in a full swing mode. Hadoop (Hadoop 1.0) has progressed from a more restricted processing
model of batch oriented MapReduce jobs to developing specialized and interactive processing
models (Hadoop 2.0). With the advent of Hadoop 2.0, it is possible for organizations to create
data crunching methodologies within Hadoop which were not possible with Hadoop 1.0
architectural limitations. In this piece of writing we provide the users an insight on the
novel Hadoop 2.0 (YARN) and help them understand the need to switch from Hadoop 1.0 to
Hadoop 2.0.
Evolution of Hadoop 2.0 (YARN) -Swiss Army Knife of Big Data
With the introduction of Hadoop in 2005 to support cluster distributed processing of large scale
data workloads through the MapReduce processing engine, Hadoop has undergone a great
refurbishment over time. The result of this is a better and advanced Hadoop framework that does
not merely support MapReduce but renders support to various other distributed processing
models also.
The huge data giants on the web such as Google, Yahoo and Facebook who had adopted Apache
Hadoop had to depend on the partnership of Hadoop HDFS with the resource management
environment and MapReduce programming. These technologies collectively enabled the users to
manage processes and store huge amounts of semi-structured, structured or unstructured data
within Hadoop clusters. Nevertheless there were certain intrinsic drawbacks with Hadoop
MapReduce pairing. For instance, Google and other users of Apache Hadoop had various
alluding issues with Hadoop 1.0 of not having the ability to keep track with the flood of
information that they were collecting online due to the batch processing arrangement of
MapReduce.
2
Figure: - 1
What Is Hadoop?
To get started, let’s look at a simple definition of the tool that the utilities we’ll discuss support.
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using simple programming model. It is an open-source
data management system with scale-out storage and distributed processing. It’s designed with
big data in mind and is ideal for large amounts of information
The Hadoop Ecosystem
The Hadoop Ecosystem consists of tools for data analysis, moving large amounts of unstructured
and structured data, data processing, querying data, storing data, and other similar data-oriented
processes. These utilities each serve a unique purpose and are geared toward different tasks
completed through or user roles interacting with Hadoop.
3
Data Storage
HDFS (Hadoop Distributed File System) is the key component that makes up Hadoop. HDFS is
used to store and access huge file based on client/server architecture. This system also enables
the distribution and storage of data across Hadoop clusters.
HBase (Hadoop Database) is a columnar database built on top of the HDFS. Being a file system,
HDFS lacks the random read and write capability. It is when HBase steps in and provides fast
record lookups in large tables.
Data Processing
MapReduce is a parallel data processing framework over clusters. Using MapReduce can help
data seeker save a lot of time, for example, if it takes a normal relational database around 20
hours to process a large data set, it might take MapReduce only around three minutes to get
everything done.
YARN (Yet Another Resource Negotiator) is a resource manager. It is said to be the second
generation of MapReduce and also a critical advancement from Hadoop 1. YARN acts the role
of an operating system, its jobs is to manage and monitor workloads, make sure it can serve
multiple clients and perform security controls. In addition, YARN supports new processing
models that MapReduce does not.
Data Access
Hive is new kind of structured query language. It was born to help who are familiar with the
traditional database and SQL to leverage Hadoop and MapReduce.
Pig serves the analysis purpose for large data sets. Pig is made up of two components, firstly the
platform to execute Pig programs; secondly, a powerful and simple scripting language called
PigLatin, which is used to write those programs.
Mahout provides a library of the most popular machine learning algorithms written in Java that
supports collaborative filtering, clustering, and classification.
4
Arvo is a data serialization system. It uses JSON for defining data types and protocols to support
data-driven applications. Arvo provides a simple integration with many different languages with
the expectation to support Hadoop application to be written in other languages (e.g. Python,
C++) rather than Java.
Sqoop (SQL + Hadoop = Sqoop) is a command line interface application, which helps transfer
data between Hadoop and relational databases (e.g. MySQL or Oracle) or mainframes.
Data Management
Oozie is a workflow scheduler for Hadoop. Oozie streamlines the process of creating workflows
and managing coordination jobs among Hadoop and other applications such as Map Reduce, Pig,
Sqoop, Hive etc. The main responsibilities of Oozie are: firstly to define a sequence of actions to
be executed; secondly, to place triggers for those actions.
Chukwa is another framework that is built on top of HDFS and Map Reduce. Its purpose is to
provide a dynamic and powerful data collection system. Chukwa is capable of monitoring,
analyzing and presenting the results to get the most out of collected data.
Flume is also a scalable and reliable system for collecting and moving cluster logs from various
sources to a centralized store like Chukwa. However, there are some differences. In Flume,
chunks of data are transferred from node to node in store and forward manner; while in Chukwa,
the agent of each machine will need to determine what data to be sent.
ZooKeeper is a distributed coordination service for distributed system. It provides a very simple
programming interface and helps reduce the management complexity by providing services such
as configuration, distributed synchronization, naming, group services etc.
5
CHAPTER-2
HDFS (HADOOP DISTRIBUTION FILE SYSTEM)
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using
low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.
Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
HDFS Architecture
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening files and
directories.
6
Figure: - 2
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are called
as blocks. In other words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.
7
Goals of HDFS
 Fault detection and recovery : Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
 Huge datasets : HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
 Hardware at data : A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
8
CHAPTER-3
MAPREDUCE
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
 Generally MapReduce paradigm is based on sending the computer to where the data
resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage : The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage : This stage is the combination of the Shufflestage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
9
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Figure: - 3
10
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need
to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Figure: - 4
Terminology
 PayLoad - Applications implement the Map and the Reduce functions, and form the
core of the job.
 Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
 NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
 DataNode - Node where data is presented in advance before any processing takes place.
 MasterNode - Node where JobTracker runs and which accepts job requests from clients.
 SlaveNode - Node where Map and Reduce program runs.
 JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
11
 Task Tracker - Tracks the task and reports status to JobTracker.
 Job - A program is an execution of a Mapper and Reducer across a dataset.
 Task - An execution of a Mapper or a Reducer on a slice of data.
 Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
12
CHAPTER-4
PROJECT
Aim:- Temperature Data Analyses.
National Climatic Data Center (NCDC) is responsible for preserving, monitoring, assessing, and
providing public access to weather data. A log file is created to store all this information.this file
includes the various type data related to climate like temperature, wind flow its direction,
information related to the cyclones, whether change, the tempratature of each day is also noted.
Through this project we analyze the temperature variation of the whole month and years. With
the help of map reducing technique we can calculate the highest and lowest temperature or
hottest or coolest day of the month or year.
After going through wordcount mapreduce guide, you now have the basic idea of how a
mapreduce program works. So, let us see a complex mapreduce program on weather dataset.
Here I am using one of the dataset of year 2015 of Austin, Texas . We will do analytics on the
dataset and classify whether it was a hot day or a cold day depending on the temperature
recorded by NCDC.
NCDC gives us all the weather data we need for this mapreduce project.
The dataset which we will be using looks like below snapshot.
Figure: - 5
13
Step 1:- Import the project in ECLIPS IDE.
Step 2:- When the project is not having any error, we will export it as a jar file, same as we did
in wordcount mapreduce guide. Right Click on the Project file and click on Export. Select jar file
Figure: - 6
14
Give the path we want to save the file
Figure: - 7
15
Select the mail file clicking on the browser
Figure: - 8
16
Click on finish the export
Figure: - 9
17
Step 3:- Before running the mapreduce program to check what it does, see that your cluster is
up and all the hadoop daemons are running.
Figure: - 10
18
Step 4:- Select the input file on hdfs
Command :- hdfs –put download/inputfile.txt
Figure: - 11
19
Step 5:- Run jar file
Command :- hadoop jar temp.jar /wathear-data.txt /output
Figure: - 12
20
References:-
https://www.ncdc.noaa.gov/
https://www.tutorialspoint.com/
http://hadoop.apache.org/

Mais conteúdo relacionado

Mais procurados

INTRODUCTION TO CLOUD COMPUTING
INTRODUCTION TO CLOUD COMPUTINGINTRODUCTION TO CLOUD COMPUTING
INTRODUCTION TO CLOUD COMPUTINGTanmoy Barman
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Public vs private vs hybrid cloud what is best for your business-
Public vs private vs hybrid cloud  what is best for your business-Public vs private vs hybrid cloud  what is best for your business-
Public vs private vs hybrid cloud what is best for your business-Everdata Technologies
 
Cloud File System and Cloud Data Management Interface (CDMI)
Cloud File System and Cloud Data Management Interface (CDMI)Cloud File System and Cloud Data Management Interface (CDMI)
Cloud File System and Cloud Data Management Interface (CDMI)Calsoft Inc.
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Virtualization in cloud computing ppt
Virtualization in cloud computing pptVirtualization in cloud computing ppt
Virtualization in cloud computing pptMehul Patel
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Cloud monitoring
Cloud monitoringCloud monitoring
Cloud monitoringGang Tao
 
Cloud Security, Standards and Applications
Cloud Security, Standards and ApplicationsCloud Security, Standards and Applications
Cloud Security, Standards and ApplicationsDr. Sunil Kr. Pandey
 

Mais procurados (20)

Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
INTRODUCTION TO CLOUD COMPUTING
INTRODUCTION TO CLOUD COMPUTINGINTRODUCTION TO CLOUD COMPUTING
INTRODUCTION TO CLOUD COMPUTING
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Middleware Technologies ppt
Middleware Technologies pptMiddleware Technologies ppt
Middleware Technologies ppt
 
Public vs private vs hybrid cloud what is best for your business-
Public vs private vs hybrid cloud  what is best for your business-Public vs private vs hybrid cloud  what is best for your business-
Public vs private vs hybrid cloud what is best for your business-
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Cloud Service Models
Cloud Service ModelsCloud Service Models
Cloud Service Models
 
Cloud File System and Cloud Data Management Interface (CDMI)
Cloud File System and Cloud Data Management Interface (CDMI)Cloud File System and Cloud Data Management Interface (CDMI)
Cloud File System and Cloud Data Management Interface (CDMI)
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Cloud sim
Cloud simCloud sim
Cloud sim
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Virtualization in cloud computing ppt
Virtualization in cloud computing pptVirtualization in cloud computing ppt
Virtualization in cloud computing ppt
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Cloud monitoring
Cloud monitoringCloud monitoring
Cloud monitoring
 
Cloud Security, Standards and Applications
Cloud Security, Standards and ApplicationsCloud Security, Standards and Applications
Cloud Security, Standards and Applications
 

Destaque

Plan de acción y prevención de Costa Rica frente a la amenaza que representa ...
Plan de acción y prevención de Costa Rica frente a la amenaza que representa ...Plan de acción y prevención de Costa Rica frente a la amenaza que representa ...
Plan de acción y prevención de Costa Rica frente a la amenaza que representa ...ExternalEvents
 
Grr pielonefritis aguda
Grr pielonefritis agudaGrr pielonefritis aguda
Grr pielonefritis agudavicfilos
 
Meetup WordPress Pontevedra > WooCommerce con Fernando Portomeñe
Meetup WordPress Pontevedra > WooCommerce con Fernando PortomeñeMeetup WordPress Pontevedra > WooCommerce con Fernando Portomeñe
Meetup WordPress Pontevedra > WooCommerce con Fernando PortomeñeJuan Hernando García
 
Android alumni application
Android alumni applicationAndroid alumni application
Android alumni applicationdharmawath
 
A2 Target Audience
A2 Target AudienceA2 Target Audience
A2 Target AudienceChloe G
 

Destaque (6)

Plan de acción y prevención de Costa Rica frente a la amenaza que representa ...
Plan de acción y prevención de Costa Rica frente a la amenaza que representa ...Plan de acción y prevención de Costa Rica frente a la amenaza que representa ...
Plan de acción y prevención de Costa Rica frente a la amenaza que representa ...
 
Grr pielonefritis aguda
Grr pielonefritis agudaGrr pielonefritis aguda
Grr pielonefritis aguda
 
Meetup WordPress Pontevedra > WooCommerce con Fernando Portomeñe
Meetup WordPress Pontevedra > WooCommerce con Fernando PortomeñeMeetup WordPress Pontevedra > WooCommerce con Fernando Portomeñe
Meetup WordPress Pontevedra > WooCommerce con Fernando Portomeñe
 
CoCurricularPresentation-V Thompson
CoCurricularPresentation-V ThompsonCoCurricularPresentation-V Thompson
CoCurricularPresentation-V Thompson
 
Android alumni application
Android alumni applicationAndroid alumni application
Android alumni application
 
A2 Target Audience
A2 Target AudienceA2 Target Audience
A2 Target Audience
 

Semelhante a project report on hadoop

Semelhante a project report on hadoop (20)

2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
paper
paperpaper
paper
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 

Último

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

project report on hadoop

  • 1. 1 CHAPTER-1 HADOOP ECHOSYSTEM 2.X.X With a rapid pace in evolution of Big Data, its processing frameworks also seem to be evolving in a full swing mode. Hadoop (Hadoop 1.0) has progressed from a more restricted processing model of batch oriented MapReduce jobs to developing specialized and interactive processing models (Hadoop 2.0). With the advent of Hadoop 2.0, it is possible for organizations to create data crunching methodologies within Hadoop which were not possible with Hadoop 1.0 architectural limitations. In this piece of writing we provide the users an insight on the novel Hadoop 2.0 (YARN) and help them understand the need to switch from Hadoop 1.0 to Hadoop 2.0. Evolution of Hadoop 2.0 (YARN) -Swiss Army Knife of Big Data With the introduction of Hadoop in 2005 to support cluster distributed processing of large scale data workloads through the MapReduce processing engine, Hadoop has undergone a great refurbishment over time. The result of this is a better and advanced Hadoop framework that does not merely support MapReduce but renders support to various other distributed processing models also. The huge data giants on the web such as Google, Yahoo and Facebook who had adopted Apache Hadoop had to depend on the partnership of Hadoop HDFS with the resource management environment and MapReduce programming. These technologies collectively enabled the users to manage processes and store huge amounts of semi-structured, structured or unstructured data within Hadoop clusters. Nevertheless there were certain intrinsic drawbacks with Hadoop MapReduce pairing. For instance, Google and other users of Apache Hadoop had various alluding issues with Hadoop 1.0 of not having the ability to keep track with the flood of information that they were collecting online due to the batch processing arrangement of MapReduce.
  • 2. 2 Figure: - 1 What Is Hadoop? To get started, let’s look at a simple definition of the tool that the utilities we’ll discuss support. Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using simple programming model. It is an open-source data management system with scale-out storage and distributed processing. It’s designed with big data in mind and is ideal for large amounts of information The Hadoop Ecosystem The Hadoop Ecosystem consists of tools for data analysis, moving large amounts of unstructured and structured data, data processing, querying data, storing data, and other similar data-oriented processes. These utilities each serve a unique purpose and are geared toward different tasks completed through or user roles interacting with Hadoop.
  • 3. 3 Data Storage HDFS (Hadoop Distributed File System) is the key component that makes up Hadoop. HDFS is used to store and access huge file based on client/server architecture. This system also enables the distribution and storage of data across Hadoop clusters. HBase (Hadoop Database) is a columnar database built on top of the HDFS. Being a file system, HDFS lacks the random read and write capability. It is when HBase steps in and provides fast record lookups in large tables. Data Processing MapReduce is a parallel data processing framework over clusters. Using MapReduce can help data seeker save a lot of time, for example, if it takes a normal relational database around 20 hours to process a large data set, it might take MapReduce only around three minutes to get everything done. YARN (Yet Another Resource Negotiator) is a resource manager. It is said to be the second generation of MapReduce and also a critical advancement from Hadoop 1. YARN acts the role of an operating system, its jobs is to manage and monitor workloads, make sure it can serve multiple clients and perform security controls. In addition, YARN supports new processing models that MapReduce does not. Data Access Hive is new kind of structured query language. It was born to help who are familiar with the traditional database and SQL to leverage Hadoop and MapReduce. Pig serves the analysis purpose for large data sets. Pig is made up of two components, firstly the platform to execute Pig programs; secondly, a powerful and simple scripting language called PigLatin, which is used to write those programs. Mahout provides a library of the most popular machine learning algorithms written in Java that supports collaborative filtering, clustering, and classification.
  • 4. 4 Arvo is a data serialization system. It uses JSON for defining data types and protocols to support data-driven applications. Arvo provides a simple integration with many different languages with the expectation to support Hadoop application to be written in other languages (e.g. Python, C++) rather than Java. Sqoop (SQL + Hadoop = Sqoop) is a command line interface application, which helps transfer data between Hadoop and relational databases (e.g. MySQL or Oracle) or mainframes. Data Management Oozie is a workflow scheduler for Hadoop. Oozie streamlines the process of creating workflows and managing coordination jobs among Hadoop and other applications such as Map Reduce, Pig, Sqoop, Hive etc. The main responsibilities of Oozie are: firstly to define a sequence of actions to be executed; secondly, to place triggers for those actions. Chukwa is another framework that is built on top of HDFS and Map Reduce. Its purpose is to provide a dynamic and powerful data collection system. Chukwa is capable of monitoring, analyzing and presenting the results to get the most out of collected data. Flume is also a scalable and reliable system for collecting and moving cluster logs from various sources to a centralized store like Chukwa. However, there are some differences. In Flume, chunks of data are transferred from node to node in store and forward manner; while in Chukwa, the agent of each machine will need to determine what data to be sent. ZooKeeper is a distributed coordination service for distributed system. It provides a very simple programming interface and helps reduce the management complexity by providing services such as configuration, distributed synchronization, naming, group services etc.
  • 5. 5 CHAPTER-2 HDFS (HADOOP DISTRIBUTION FILE SYSTEM) Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing. Features of HDFS  It is suitable for the distributed storage and processing.  Hadoop provides a command interface to interact with HDFS.  The built-in servers of namenode and datanode help users to easily check the status of cluster.  Streaming access to file system data.  HDFS provides file permissions and authentication. HDFS Architecture Namenode The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:  Manages the file system namespace.  Regulates client’s access to files.  It also executes file system operations such as renaming, closing, and opening files and directories.
  • 6. 6 Figure: - 2 Datanode The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.  Datanodes perform read-write operations on the file systems, as per client request.  They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode. Block Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
  • 7. 7 Goals of HDFS  Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.  Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.  Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.
  • 8. 8 CHAPTER-3 MAPREDUCE MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. What is MapReduce? MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model. The Algorithm  Generally MapReduce paradigm is based on sending the computer to where the data resides!  MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. o Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. o Reduce stage : This stage is the combination of the Shufflestage and the Reduce stage. The Reducer’s job is to process the data that comes from the
  • 9. 9 mapper. After processing, it produces a new set of output, which will be stored in the HDFS.  During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.  The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.  Most of the computing takes place on nodes with data on local disks that reduces the network traffic.  After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server. Figure: - 3
  • 10. 10 Inputs and Outputs (Java Perspective) The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output). Input Output Map <k1, v1> list (<k2, v2>) Reduce <k2, list(v2)> list (<k3, v3>) Figure: - 4 Terminology  PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.  Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.  NamedNode - Node that manages the Hadoop Distributed File System (HDFS).  DataNode - Node where data is presented in advance before any processing takes place.  MasterNode - Node where JobTracker runs and which accepts job requests from clients.  SlaveNode - Node where Map and Reduce program runs.  JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
  • 11. 11  Task Tracker - Tracks the task and reports status to JobTracker.  Job - A program is an execution of a Mapper and Reducer across a dataset.  Task - An execution of a Mapper or a Reducer on a slice of data.  Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
  • 12. 12 CHAPTER-4 PROJECT Aim:- Temperature Data Analyses. National Climatic Data Center (NCDC) is responsible for preserving, monitoring, assessing, and providing public access to weather data. A log file is created to store all this information.this file includes the various type data related to climate like temperature, wind flow its direction, information related to the cyclones, whether change, the tempratature of each day is also noted. Through this project we analyze the temperature variation of the whole month and years. With the help of map reducing technique we can calculate the highest and lowest temperature or hottest or coolest day of the month or year. After going through wordcount mapreduce guide, you now have the basic idea of how a mapreduce program works. So, let us see a complex mapreduce program on weather dataset. Here I am using one of the dataset of year 2015 of Austin, Texas . We will do analytics on the dataset and classify whether it was a hot day or a cold day depending on the temperature recorded by NCDC. NCDC gives us all the weather data we need for this mapreduce project. The dataset which we will be using looks like below snapshot. Figure: - 5
  • 13. 13 Step 1:- Import the project in ECLIPS IDE. Step 2:- When the project is not having any error, we will export it as a jar file, same as we did in wordcount mapreduce guide. Right Click on the Project file and click on Export. Select jar file Figure: - 6
  • 14. 14 Give the path we want to save the file Figure: - 7
  • 15. 15 Select the mail file clicking on the browser Figure: - 8
  • 16. 16 Click on finish the export Figure: - 9
  • 17. 17 Step 3:- Before running the mapreduce program to check what it does, see that your cluster is up and all the hadoop daemons are running. Figure: - 10
  • 18. 18 Step 4:- Select the input file on hdfs Command :- hdfs –put download/inputfile.txt Figure: - 11
  • 19. 19 Step 5:- Run jar file Command :- hadoop jar temp.jar /wathear-data.txt /output Figure: - 12