Hadoop Distriubted File System (HDFS) presentation 27- 5-2015

In The Name of Allah The Most Merciful The
Most Gracious
• Name: Abdul Nasir Afridi
• Roll Number:01
• Batch#10
• Subject: Advanced Database And Data
mining.
Page-1

Research Article
1. Performance Evaluation of Read and
Write operations in Hadoop Distributed
File System.
Published: 2014 Sixth International
Symposium on Parallel Architectures,
Algorithms and Programming
Conference Paper: IEEE Computer
Society
Authors: Dr T Ragunathan et al.
7B-2

Research Article
• High Performance and Fault Tolerant
Distributed File System for Big Data
Storage and Processing using Hadoop
• Published: 2014 International
Conference on Intelligent Computing
Applications
• © 2014 IEEE Conference Publishing
Services
7B-3

Research Article
• A Distributed Storage Model for EHR
Based on HBase
• Published: © 2011 IEEE International
Conference on Information
Management, Innovation Management
and Industrial Engineering
7B-4

Research Article
7B-5
H-Store: A High-Performance, Distributed
Main Memory Transaction Processing
System
Published: August 23-28, 2008, Auckland,
New Zealand
Conference Paper:ACM 978-1-60558-
306-8/08/08
Copyright 2008 VLDB Endowment,

• Keywords-
• Hadoop Distributed File System(HDFS);
• H-Base
• Electronic healthcare record(EHR)
• Distritued Storage
• Big Data
• MapReduce
7B-6

What is Apache Hadoop?
• Hadoop Distributed File System:
• HDFS, the storage layer of Hadoop, is a
distributed, scalable, Java-based file system
adept at storing large volumes of unstructured
data
• It is an open-source system developed by
Apache in Java.
• It is designed to handle very large data sets.
• It is designed to scale to very large clusters.
• It is designed to run on commodity hardware.
7B-7

Hadoop echosystem
• Hadoop Distributed File System:HDFS, the
storage layer of
• Hadoop, is a distributed, scalable, Java-based
file system.
• It offers data replication.
• It offers automatic failover in the event of a
crash. •
• It automatically fragments storage over the
cluster. •
• It brings processing to the data. •
• Its supportlarge volumes of file into the milion7B-12

Hadoop echosystem
• MapReduce:
• MapReduce is a software framework that
serves as the compute layer of Hadoop.
• MapReduce jobs are divided into two
parts.The mapfunction divides a query into
multiple parts and processes data at the node
level.
• The reducefunction aggregates the results of
the map function to determine the answer to
the query.
7B-13

Hadoop echosystem
• Hive:
Hive is a Hadoop-based data warehouse
developed by Facebook. It allows users to
write queries in SQL, which are then
converted to map-reduce. This allows SQL
programmers with no map-reduce experience
to use the warehouse and makes it easier to
integrate with business intelligence and
visualization tools such as Micro Strategy,
Tableau, Revolutions Analytics, etc
7B-14

Hadoop echosystem
• Pig:
Pig Latin is a Hadoop-based language
developed by Yahoo.
It is relatively easy to learn and is adept at
very deep, very long data pipelines (a
limitation of SQL.)
Pig, originally developed at Yahoo research,
is a high-level language for building map-
reduce programs for Hadoop,
thus simplifying the use of map-reduce. It is a
data flow language that provides high-level
commands7B-15

Hadoop echosystem
• HBase:
• HBase is a non-relational database that
allows for low-latency, quick lookups in
Hadoop.
• It adds transactional capabilities to
Hadoop, allowing users to conduct
updates,inserts, and deletes.
• E-Bay and Facebook use HBase
heavily
7B-17

Hadoop echosystem
• Flume:
• Flume is a framework for populating
Hadoop with data.
• Agents are populated throughout ones’
IT infrastructure (inside web servers,
application servers, and mobile devices,
for example) to collect data and
integrate it into Hadoop.
7B-18

Hadoop echosystem
• Oozie:
• Oozie is a workflow processing system that
lets users define a series of jobs written in
multiple languages (such as mapreduce, Pig
and Hive) then intelligently links them to one
another.
• Oozie allows users to specify, for example,
that a particular query is only to be initiated
after specified previous jobs on which it relies
for data are completed
7B-19

Hadoop echosystem
• Whirr:
• Whirr is a set of libraries that allows
users to easily spin-up Hadoop clusters
on top of Amazon EC2, Rackspace, or
any virtual infrastructure.
• It supports all major virtualized
infrastructure vendors on the market
7B-20

Hadoop echosystem
• Avro:
• Avro is a data serialization system that
allows for encoding the schema of
Hadoop files.
• It is adept at parsing data and
performing removed procedure calls.
7B-21

Hadoop echosystem
• Mahout:
• Mahout is a data-mining library.
• It takes the most popular data-mining
algorithms for performing clustering,
regression testing, and statistical
modeling
• and implements them using the map-
reduce mode
7B-22

Hadoop echosystem
• Sqoop:
• Sqoop is a connectivity tool for moving data
from non-Hadoop data stores such as
relational databases and data warehouses
into Hadoop.
• It allows users to specify the target location
inside of Hadoop and instruct Sqoop to move
data from Oracle, Teradata, or other relational
databases to the target
7B-24

Hadoop Configuration File
7B-25

Joining Type Venn Diagram
7B-27

Big data
Big data is being generated by everything
around us at all times.
 Every digital process and social media
exchange produces it.
 Systems, sensors and mobile devices
transmit it.
Big data is arriving from multiple sources at an
alarming velocity, volume and variety.
To extract meaningful value from big data,
you need optimal processing power, analytics
capabilities and skills.
7B-28

Typical Hadoop cluster integrates MapReduce and
HFDS
Master/slave architecture
7B-30

Pictorial Representation Hadoop
7B-31

Physical Architecture of Hadoop echosystem
7B-32

Scheduling
• By default
▫ Hadoop uses FIFO to schedule jobs.
▫  No preemption once a job is running.
In Hadoop version 2.x fair scheduling
introduces.assigning resources to
applications such that all applications
get, on average, an equal share of
resources over time
7B-36

References
• Reference
• The Ministry of Health of P . R. China.
Health records infrastructure and data
standards.[CP/OL].[ 2009 05]
http://www.moh.gov.cn/publicfiles/busin
ess/cmsresources/mohbgt/cmsrsdocum
ent/doc4359.doc
• Jonathan R. Owens. Hadoop Real-
World Solutions Cookbook Copyright©
2013 Packt Publishing
7B-38

References
• HDFS:Architecture[OL].http://hadoop.apache.
org/
• Terabyte sort[OL]. http://sortbenchmark.org/.
• T. White, Hadoop: The Definitive Guide.
O'Reilly Media, Yahoo! Press, June 5, 2009.
• Mahesh, Bharath, Keerthivasan, “Review of
Distributed File Systems: Concepts and Case
Studies” ECE 677 Distributed Computing
Systems - Fall 2010
• Jeff Markham , Apache Hadoop™ YARN.
• Addison-Wesley Press ,2014
7B-39

References
• Eric Sammer ,Hadoop Operations
Copyright © 2012 Published by O’Reilly
Media
• Kevin Sitto and Marshall Presser,Field
Guide to Hadoop, Copyright © 2015,
Published by O’Reilly Media
• John Wiley & Sons, NoSQL For
Dummies® New Jersey Media and
software compilation copyright © 2015
7B-40

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Hadoop Distriubted File System (HDFS) presentation 27- 5-2015

Semelhante a Hadoop Distriubted File System (HDFS) presentation 27- 5-2015 (20)

Último

Último (20)

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015