SlideShare uma empresa Scribd logo
1 de 24
MANAGING BIG DATA WITH 
HADOOP 
Presented by: 
Nalini Mehta 
Student(MLVTEC Bhilwara) 
Email: nalinimehta52@gmail.com
Introduction 
Big Data: 
•Big data is a term used to describe the voluminous amount of unstructured and 
semi-structured data . 
•Data that would take too much time and cost too much money to load into a 
relational database for analysis. 
• Big data doesn't refer to any specific quantity, the term is often used when 
speaking about petabytes and exabytes of data.
General framework of Big Data 
Networking 
 The driving force behind 
the implementation of Big 
data is both infrastructure 
and analytics which 
together constitutes the 
software. 
 Hadoop is the Big Data 
management software 
which is used to 
distribute, catalogue 
manage and query data 
across multiple, 
horizontally scaled server 
nodes.
Managing Big Data
Overview of Hadoop 
• Hadoop is a platform for 
processing large amount of 
data in distributed fashion. 
• It provides scheduling and 
resource management 
framework to execute the 
map and to reduce phases 
in the cluster environment. 
• Hadoop Distributed File is 
Hadoop’s data storage layer 
which is designed to handle 
the petabytes and exabytes 
of data distributed over 
multiple nodes in parallel.
Hadoop Cluster 
• DataNode- The DataNodes are 
the repositories for the data, and it 
consist of multiple smaller 
database infrastructures. 
• Client- The client represents the 
user interface to the big data 
implementation and query engine. 
The client could be a server or PC 
with a traditional user interface. 
• NameNode- the NameNode is 
equivalent to the address router 
and location of every data node. 
• Job Tracker- The job tracker 
represents the software tracking 
mechanism to distribute and 
aggregate search queries across 
multiple nodes for ultimate client 
analysis.
Apache Hadoop 
• Apache Hadoop is an open source distributed software platform for 
storing and processing data. 
• It is a framework for running applications on large cluster built of 
commodity hardware. 
• A common way of avoiding data loss is through replication: 
redundant copies of the data are kept by the system so that in the 
event of failure, there is another copy available. The Hadoop 
Distributed File system (HDFS), takes care of this problem. 
• MapReduce is a simple programming model for processing and 
generating large data sets.
What is MapReduce? 
 MapReduce is a programming model . 
 Programs written automatically parallelized and executed on a large 
cluster of commodity machines. 
 Users specify a map function that processes a key/value pair to 
generate a set of intermediate key/value pair, and a reduce function that 
merges all intermediate values associated with the same intermediate 
key. 
MapReduce 
MAP 
map function that 
processes a key/value 
pair to generate a set of 
intermediate key/value 
pairs 
REDUCE 
and a reduce function 
that merges all 
intermediate values 
associated with the 
same intermediate key.
The Programming Model Of MapReduce 
 Map, written by the user, takes an input pair and produces a set of 
intermediate key/value pairs. The MapReduce library groups 
together all intermediate values associated with the same 
intermediate key and passes them to the Reduce function.
 The Reduce function, also written by the user, accepts 
an intermediate key and a set of values for that key. 
It merges together these values to form a possibly 
smaller set of values.
HADOOP DISTRIBUTED FILE 
SYSTEM (HDFS) 
 Apache Hadoop comes with a distributed file system called HDFS, 
which stands for Hadoop Distributed File System. 
 HDFS is designed to hold very large amounts of data (terabytes or 
even petabytes), and provide high-throughput access to this 
information. 
 HDFS is designed for scalability and fault tolerance and provides 
APIs MapReduce applications to read and write data in parallel. 
 The capacity and performance of HDFS can be scaled by adding 
Data Nodes, and a single Name Node mechanisms that manages 
data placement and monitor server availability.
Assumptions and Goals 
1. Hardware Failure 
• An HDFS instance may consist of hundreds or thousands of server machines, 
each storing part of the file system’s data. 
• There are a huge number of components and that each component has a non-trivial 
probability of failure. 
• Detection of faults and quick, automatic recovery from them is a core 
architectural goal of HDFS. 
2. Streaming Data Access 
• Applications that run on HDFS need streaming access to their data sets. 
• HDFS is designed more for batch processing rather than interactive use by 
users. 
• The emphasis is on high throughput of data access rather than low latency of 
data access. 
3. Large Data Sets 
• A typical file in HDFS is gigabytes to terabytes in size. 
• Thus, HDFS is tuned to support large files. 
• It should provide high aggregate data bandwidth and scale to hundreds of 
nodes in a single cluster.
4. Simple coherency model 
• HDFS applications need a write-once-read-many access model for files. 
• A file once created, written, and closed need not be changed. 
• This assumption simplifies data coherency issues and enables high 
throughput data access. 
5. “Moving Computation is Cheaper than Moving 
Data” 
• A computation requested by an application is much more efficient if it is 
executed near the data it operates on when the size of the data set is huge. 
• This minimizes network congestion and increases the overall throughput of 
the system. 
6. Portability across Heterogeneous Hardware and 
Software Platforms 
• HDFS has been designed to be easily portable from one platform to 
another. This facilitates widespread adoption of HDFS as a platform of 
choice for a large set of applications.
Concepts of HDFS:
NameNode and DataNodes 
 A HDFS cluster has two 
types of node operating in 
a master-slave pattern: a 
NameNode (the master) 
and a number of 
DataNodes (slaves). 
 The NameNode manages 
the file system 
namespace. It maintains 
the file system tree and 
the metadata for all the 
files and directories in the 
tree. 
 Internally a file is split into 
one or more blocks and 
these blocks are stored in 
a set of DataNodes.
 The NameNode executes file system namespace 
operations like opening, closing, and renaming 
files and directories. 
 DataNodes store and retrieve blocks when they 
are told to (by clients or the NameNode), and they 
report back to the NameNode periodically with lists 
of blocks that they are storing. 
 The DataNodes also perform block creation, 
deletion, and replication upon instruction from the 
NameNode. 
 Without the NameNode, the file system cannot be 
used. In fact, if the machine running the 
NameNode were destroyed, all the files on the file 
system would be lost since there would be no way 
of knowing how to reconstruct the files from the 
blocks on the DataNodes.
File System Namespace 
 HDFS supports a traditional hierarchical file 
organization. A user or an application can create 
and remove files, move a file from one directory to 
another, rename a file, create directories and store 
files inside these directories. 
 The NameNode maintains the file system 
namespace. Any change to the file system 
namespace or its properties is recorded by the 
NameNode. 
 An application can specify the number of replicas of 
a file that should be maintained by HDFS. The 
number of copies of a file is called the replication 
factor of that file. This information is stored by the 
NameNode.
Data Replication 
 The blocks of a file are replicated for fault 
tolerance. 
 The block and replication factor are configurable as 
per file. 
 The NameNode makes all decisions regarding 
replication of blocks. 
 A Block report contains a list of all blocks on a 
DataNode.
Hadoop as a Service in the Cloud 
(Haas): 
 Hadoop is economical for large scale data driven 
companies like Yahoo or Facebook. 
 The ecosystem around Hadoop nowadays offers various 
tools like Hive and Pig to make Big Data processing 
accessible focusing on what to do with the data and to 
avoid the complexity of programming. 
 Consequently, a minimal Hadoop as a Service provide a 
managed Hadoop cluster ready to use without the need to 
configure or install any Hadoop relevant services on any 
cluster nodes like Job tracker, Task tracker, NameNode or 
DataNode. 
 Depending on the level of service, abstraction and tools 
provided, Hadoop as a Service (HaaS) can be placed in the 
cloud stack as a Platform or Software as a Service 
solutions, between infrastructure services and cloud clients.
Limitations: 
It places several requirements on the network: 
 Data locality 
 The distributed Hadoop nodes running jobs parallel 
causes east-west network traffic that can be adversely 
affected by the suboptimal network connectivity. 
 The network should provide high bandwidth, low latency 
and any to any connectivity between the nodes for 
optimal Hadoop performance. 
 Scale out 
 Deployments might start with a small cluster and then 
scale out over time as the customer may realize the 
initial success and then needs. 
 The underlying network architecture should also scale 
seamlessly with Hadoop clusters and should provide 
predictable performance.
Conclusion 
 The growth of communication and 
connectivity has led to the emergence of 
Big Data. Apache Hadoop is an open 
source framework that has become a de-facto 
standard for big data platforms 
deployed today. 
 To sum up, we conclude that promising 
progress has been made in the area of 
Big Data but much remains to be done. 
Almost all proposed approaches are 
evaluated to a limited scale, and further 
research is required for large scale 
evaluations.
References: 
 White paper –Introduction to Big Data: Infrastructure 
and Network consideration 
 MapReduce: Simplified Data processing on Large 
Clusters, http://research .google.com/archive 
/mapreduce.html 
 White paper Big Data Analytics[http:/Hadoop.intel.com] 
 The Hadoop Distributed File System Architecture and 
Design:by Dhruba Borthakur 
 Big Data in the enterprise, Cisco White Paper. 
 Cloudera capacity planning recommendations: 
http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity- 
planning/ 
 Apache Hadoop Wiki Website: 
http://en.wikipedia.org/wiki/Apache-Hadoop. 
 Towards a Big Data Reference Architecture 
 [www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]
Managing Big Data with Hadoop

Mais conteúdo relacionado

Mais procurados

Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
An Introduction to PREMIS
An Introduction to PREMISAn Introduction to PREMIS
An Introduction to PREMISJenn Riley
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 

Mais procurados (20)

Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Analytics Part2
Big Data Analytics Part2Big Data Analytics Part2
Big Data Analytics Part2
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
An Introduction to PREMIS
An Introduction to PREMISAn Introduction to PREMIS
An Introduction to PREMIS
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
6.hive
6.hive6.hive
6.hive
 

Semelhante a Managing Big Data with Hadoop

Semelhante a Managing Big Data with Hadoop (20)

hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 

Último

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
Cooling Tower SERD pH drop issue (11 April 2024) .pptx
Cooling Tower SERD pH drop issue (11 April 2024) .pptxCooling Tower SERD pH drop issue (11 April 2024) .pptx
Cooling Tower SERD pH drop issue (11 April 2024) .pptxmamansuratman0253
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 

Último (20)

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
Cooling Tower SERD pH drop issue (11 April 2024) .pptx
Cooling Tower SERD pH drop issue (11 April 2024) .pptxCooling Tower SERD pH drop issue (11 April 2024) .pptx
Cooling Tower SERD pH drop issue (11 April 2024) .pptx
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 

Managing Big Data with Hadoop

  • 1. MANAGING BIG DATA WITH HADOOP Presented by: Nalini Mehta Student(MLVTEC Bhilwara) Email: nalinimehta52@gmail.com
  • 2. Introduction Big Data: •Big data is a term used to describe the voluminous amount of unstructured and semi-structured data . •Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 3.
  • 4. General framework of Big Data Networking  The driving force behind the implementation of Big data is both infrastructure and analytics which together constitutes the software.  Hadoop is the Big Data management software which is used to distribute, catalogue manage and query data across multiple, horizontally scaled server nodes.
  • 6. Overview of Hadoop • Hadoop is a platform for processing large amount of data in distributed fashion. • It provides scheduling and resource management framework to execute the map and to reduce phases in the cluster environment. • Hadoop Distributed File is Hadoop’s data storage layer which is designed to handle the petabytes and exabytes of data distributed over multiple nodes in parallel.
  • 7. Hadoop Cluster • DataNode- The DataNodes are the repositories for the data, and it consist of multiple smaller database infrastructures. • Client- The client represents the user interface to the big data implementation and query engine. The client could be a server or PC with a traditional user interface. • NameNode- the NameNode is equivalent to the address router and location of every data node. • Job Tracker- The job tracker represents the software tracking mechanism to distribute and aggregate search queries across multiple nodes for ultimate client analysis.
  • 8. Apache Hadoop • Apache Hadoop is an open source distributed software platform for storing and processing data. • It is a framework for running applications on large cluster built of commodity hardware. • A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed File system (HDFS), takes care of this problem. • MapReduce is a simple programming model for processing and generating large data sets.
  • 9. What is MapReduce?  MapReduce is a programming model .  Programs written automatically parallelized and executed on a large cluster of commodity machines.  Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pair, and a reduce function that merges all intermediate values associated with the same intermediate key. MapReduce MAP map function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE and a reduce function that merges all intermediate values associated with the same intermediate key.
  • 10. The Programming Model Of MapReduce  Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function.
  • 11.  The Reduce function, also written by the user, accepts an intermediate key and a set of values for that key. It merges together these values to form a possibly smaller set of values.
  • 12. HADOOP DISTRIBUTED FILE SYSTEM (HDFS)  Apache Hadoop comes with a distributed file system called HDFS, which stands for Hadoop Distributed File System.  HDFS is designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.  HDFS is designed for scalability and fault tolerance and provides APIs MapReduce applications to read and write data in parallel.  The capacity and performance of HDFS can be scaled by adding Data Nodes, and a single Name Node mechanisms that manages data placement and monitor server availability.
  • 13. Assumptions and Goals 1. Hardware Failure • An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. • There are a huge number of components and that each component has a non-trivial probability of failure. • Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 2. Streaming Data Access • Applications that run on HDFS need streaming access to their data sets. • HDFS is designed more for batch processing rather than interactive use by users. • The emphasis is on high throughput of data access rather than low latency of data access. 3. Large Data Sets • A typical file in HDFS is gigabytes to terabytes in size. • Thus, HDFS is tuned to support large files. • It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster.
  • 14. 4. Simple coherency model • HDFS applications need a write-once-read-many access model for files. • A file once created, written, and closed need not be changed. • This assumption simplifies data coherency issues and enables high throughput data access. 5. “Moving Computation is Cheaper than Moving Data” • A computation requested by an application is much more efficient if it is executed near the data it operates on when the size of the data set is huge. • This minimizes network congestion and increases the overall throughput of the system. 6. Portability across Heterogeneous Hardware and Software Platforms • HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
  • 16. NameNode and DataNodes  A HDFS cluster has two types of node operating in a master-slave pattern: a NameNode (the master) and a number of DataNodes (slaves).  The NameNode manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree.  Internally a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
  • 17.  The NameNode executes file system namespace operations like opening, closing, and renaming files and directories.  DataNodes store and retrieve blocks when they are told to (by clients or the NameNode), and they report back to the NameNode periodically with lists of blocks that they are storing.  The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.  Without the NameNode, the file system cannot be used. In fact, if the machine running the NameNode were destroyed, all the files on the file system would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the DataNodes.
  • 18. File System Namespace  HDFS supports a traditional hierarchical file organization. A user or an application can create and remove files, move a file from one directory to another, rename a file, create directories and store files inside these directories.  The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode.  An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
  • 19. Data Replication  The blocks of a file are replicated for fault tolerance.  The block and replication factor are configurable as per file.  The NameNode makes all decisions regarding replication of blocks.  A Block report contains a list of all blocks on a DataNode.
  • 20. Hadoop as a Service in the Cloud (Haas):  Hadoop is economical for large scale data driven companies like Yahoo or Facebook.  The ecosystem around Hadoop nowadays offers various tools like Hive and Pig to make Big Data processing accessible focusing on what to do with the data and to avoid the complexity of programming.  Consequently, a minimal Hadoop as a Service provide a managed Hadoop cluster ready to use without the need to configure or install any Hadoop relevant services on any cluster nodes like Job tracker, Task tracker, NameNode or DataNode.  Depending on the level of service, abstraction and tools provided, Hadoop as a Service (HaaS) can be placed in the cloud stack as a Platform or Software as a Service solutions, between infrastructure services and cloud clients.
  • 21. Limitations: It places several requirements on the network:  Data locality  The distributed Hadoop nodes running jobs parallel causes east-west network traffic that can be adversely affected by the suboptimal network connectivity.  The network should provide high bandwidth, low latency and any to any connectivity between the nodes for optimal Hadoop performance.  Scale out  Deployments might start with a small cluster and then scale out over time as the customer may realize the initial success and then needs.  The underlying network architecture should also scale seamlessly with Hadoop clusters and should provide predictable performance.
  • 22. Conclusion  The growth of communication and connectivity has led to the emergence of Big Data. Apache Hadoop is an open source framework that has become a de-facto standard for big data platforms deployed today.  To sum up, we conclude that promising progress has been made in the area of Big Data but much remains to be done. Almost all proposed approaches are evaluated to a limited scale, and further research is required for large scale evaluations.
  • 23. References:  White paper –Introduction to Big Data: Infrastructure and Network consideration  MapReduce: Simplified Data processing on Large Clusters, http://research .google.com/archive /mapreduce.html  White paper Big Data Analytics[http:/Hadoop.intel.com]  The Hadoop Distributed File System Architecture and Design:by Dhruba Borthakur  Big Data in the enterprise, Cisco White Paper.  Cloudera capacity planning recommendations: http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity- planning/  Apache Hadoop Wiki Website: http://en.wikipedia.org/wiki/Apache-Hadoop.  Towards a Big Data Reference Architecture  [www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]