Overview of Hadoop and HDFS

www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com

Introduction, Background to Hadoop and HDFS!
!
!
!
!
Brendan Tierney

www.oralytics.com
t : @brendantierney

What is Big Data?
O’Reilly Radar definition:
•  Big data is when the size of the data itself becomes part of the problem
EMC/IDC definition:
•  Big Data technologies describe a new generation of technologies and
architectures, designed to economically extract value from very large
volumes of a wide variety of data, by enabling high velocity capture,
discovery and/or analysis
•  McKinsey definition:
•  Big Data refers to datasets whose size is beyond the availability of typical
database software tools to capture, store, manage and analyse
http://www.oreilly.com/data/free/big-data-now-2012.csp!
http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf!
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation!
http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june2012_cooper_mell.pdf

www.oralytics.com
t : @brendantierney

Big Data
Some Companies continue to generate large amounts of data:
•  Facebook ~ 6 billion messages per day
•  EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage
•  Satellite Images by Skybox Imaging ~ 1 Terabyte per day
•  These numbers are probably out of date before I ﬁnished writing this slide
Important : This is for some companies and not all companies
Part of their data management architecture. It will not replace existing DBs etc

www.oralytics.com
t : @brendantierney

Basic idea
•  The basic idea behind the phrase Big Data is that everything we do is increasingly
leaving a digital trace (data) which we can use and analyse
•  Big Data therefore refers to our ability to make use of ever increasing volumes of
data
Traditional data storage methods can
be a challenge!

Why ?

www.oralytics.com
t : @brendantierney

Big Data

www.oralytics.com
t : @brendantierney

2013
2013

www.oralytics.com
t : @brendantierney

2014
Where is
Predictive Analytics?

www.oralytics.com
t : @brendantierney

2015

www.oralytics.com
t : @brendantierney

Hadoop
•  Existing tools were not designed to handle such large amounts of data
•  "The Apache™ Hadoop™ project develops open-source software for reliable,
scalable, distributed computing.”
•  http://hadoop.apache.org
•  – Process Big Data on clusters of commodity hardware
•  – Vibrant open-source community
•  – Many products and tools reside on top of Hadoop

www.oralytics.com
t : @brendantierney

Who is using Hadoop in Ireland ?
Big websites

Big telcos

Big Banks

Big Financial

CERN

Big ….

www.oralytics.com
t : @brendantierney

Access Speeds?
1990:
Typical drive ~1370MB
Transfer speed ~ 4.4MB/s

read drive in 5 mins
2010:
Typical drive ~1TB
Transfer speed ~ 100MB/s

read drive in 2.5 hrs
Hadoop - 100 drives working
at the same
time can read 1TB of data in 2
minutes

www.oralytics.com
t : @brendantierney

Scaling issue
$
$
$
$ ?

www.oralytics.com
t : @brendantierney

Scaling issue
•  It is harder and more expensive to scale-up ( “It Depends” needs to be applied)
•  Add additional resources to an existing node (CPU, RAM)
•  Moore’s Law can’t keep up with data growth
•  New units must be purchased if required resources can not be added
•  Also known as scale vertically
•  Scale-Out
•  Add more nodes/machines to an existing distributed application
•  Software Layer is designed for node additions or removal
•  Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
•  Very easy to scale down as well

www.oralytics.com
t : @brendantierney

Hadoop Principles
•  Scale-Out rather than Scale-Up
•  Bring code to data rather than data to code
•  Deal with failures – they are common
•  Abstract complexity of distributed and concurrent applications
•  Self managing
•  Auto parallel processing

www.oralytics.com
t : @brendantierney

Big Data – Example Applications
Not all of these are using Hadoop or require Hadoop!

www.oralytics.com
t : @brendantierney

Hadoop Cluster
•  A set of "cheap" commodity hardware
•  Networked together
•  Resides in the same location
•  Set of servers in a set of racks in a data center
•  “Cheap” Commodity Server Hardware
•  No need for super-computers, use commodity unreliable hardware
•  Not desktops
Yes you can build a Hadoop Cluster
using Raspberry Pi’s

www.oralytics.com
t : @brendantierney

Abstracting Complexity
•  Distributed Computing is HARD WORK
•  Hadoop abstracts many complexities in distributed and concurrent applications
•  Deﬁnes small number of components
•  Provides simple and well deﬁned interfaces of interactions between these
components
•  Frees developer from worrying about system level challenges
•  race conditions, data starvation
•  processing pipelines, data partitioning, code distribution, etc.
•  Allows developers to focus on application development and business logic

www.oralytics.com
t : @brendantierney

Hadoop vs RDBMS
•  Always keep the phrase
“It Depends” in mind when
discussing Big Data
•  Hadoop != RDBMS
•  Hadoop will not replace RDBMS
•  Hadoop is part of your data
management architecture
•  and only if it is needed !

RDBMS
Hadoop
Data size
Gigabytes
Petabytes
Access
Interactive & Batch
Batch
Updates
Read & write many
times
Write once, read
many times
Integrity
High
Low
Scaling
Non Linear
Linear
Data representation
Structured
Unstructured, semi-
structured

www.oralytics.com
t : @brendantierney

Current trends for Hadoop

www.oralytics.com
t : @brendantierney

www.oralytics.com
t : @brendantierney

Working together
•  Hadoop and RDBMS frequently complement each other within an architecture
•  For example, a website that
•  has a small number of users
•  produces a large amount of audit logs

www.oralytics.com
t : @brendantierney

Hadoop Ecosystem

www.oralytics.com
t : @brendantierney

Hadoop Ecosytems

www.oralytics.com
t : @brendantierney

Hadoop Distributions
•  Large number of independent products (Apache projects)
•  Can be challenging to get all/some of these to work together
•  We will will be working with Hadoop, installing and using some products
•  Hadoop Distributions aim to resolve version incompatibilities
•  Distribution Vendor will
•  Integration Test a set of Hadoop products
•  Package Hadoop products in various installation formats
•  Linux Packages, tarballs, etc.
•  Distributions may provide additional scripts to execute Hadoop
•  Some vendors may choose to backport features and bug ﬁxes made by Apache
•  Typically vendors will employ Hadoop committers so the bugs they ﬁnd will make it
into Apache’s repository

www.oralytics.com
t : @brendantierney

Hadoop Distributions
•  Cloudera Distribution for Hadoop (CDH)
•  Check out the pre-built VM with most of Cloudera products (Hadoop, etc)
•  http://www.cloudera.com/downloads/quickstart_vms/5-8.html
•  MapR Distribution
•  Check out the MapR Sandbox VM
•  https://www.mapr.com/products/mapr-sandbox-hadoop
•  Hortonworks Data Platform (HDP)
•  Check out the Hortonworks Sandbox VM
•  http://hortonworks.com/products/sandbox/
•  Oracle Big Data Applicance
•  Check out a pre-built VM with Hadoop, Oracle and lots of other tools all installed
and conﬁgured for you to use
•  http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-
bigdatalite-2104726.html
$

www.oralytics.com
t : @brendantierney

Hadoop - “move-code-to-data” approach
•  Data is distributed among the nodes as it is initially stored in the system
•  Data is replicated multiple times on the system for increased reliability & availability
•  Master allocates work to nodes
•  Computation happens on the nodes where the data is stored - data locality
•  Nodes work in parallel each on their own part of the overall dataset
•  Nodes are independent and self-suﬃcient - shared-nothing architecture
•  If a node fails, master detects the failure and re-assigns work to other nodes
•  If a failed node restarts, it is automatically added back into the system and
assigned new tasks

www.oralytics.com
t : @brendantierney

HDFS
•  A distributed file system modelled on the Google File System (GFS) 
[http://research.google.com/archive/gfs.html]
•  Data is split into blocks, typically 64MB or 128MB in size, spread across many
nodes
•  Works better on large files >= 1 HDFS block in size
•  Each block is replicated to a number of nodes (typically 3)
•  ensures reliability and availability
•  Files in HDFS are write once - no random writes to files allowed
•  HDFS is optimised for large streaming reads of files - no random access to files
allowed
•  see HIVE later on for more DBMS-type access to HDFS files....

www.oralytics.com
t : @brendantierney

HDFS is good for
•  Storing large files
•  Terabytes, Petabytes, etc...
•  Millions rather than billions of files
•  100MB or more per file
•  Streaming data
•  Unstructured data => really mixed structured data
•  Write once and read-many times patterns
•  Schema on Read (RDBMS = schema on write)
•  Huge time saving at data write time
•  BUT !!!
•  Optimized for streaming reads rather than random reads
•  “Cheap” Commodity Hardware
•  No need for super-computers, use less reliable commodity hardware

www.oralytics.com
t : @brendantierney

HDFS is not so good at
•  Low-latency reads
•  High-throughput rather than low latency for small chunks of data
•  HBase and other DBs can address this issue (?)
•  Large amount of small files
•  Better for millions of large files instead of billions of small files
•  Block size of 128M or 256M
•  For example each file can be 100MB or more
•  Multiple Writers
•  Single writer per file
•  Writes only at the end of file, no-support for arbitrary offset
•  Time needed for replication

www.oralytics.com
t : @brendantierney

HDFS
•  Two types of nodes in a HDFS cluster
•  NameNode - the master node
•  DataNodes - slave or worker nodes
•  NameNode manages the file system
•  keeps track of the metadata - which blocks make up a file (using 2 files - namespace
image and the edit log)
•  knows on which DataNodes the blocks are stored
•  DataNodes do the work
•  store the blocks
•  retrieve blocks when requested to (by the client or the NameNode)
•  poll and report back to the NameNode periodically with the list of blocks that they are
storing

www.oralytics.com
t : @brendantierney

HDFS
•  When a client application wants to read a file...
•  it communicates with the NameNode to determine which blocks make up the file,
and on which DataNodes the block reside
•  it then communicates directly with the DataNodes
•  NameNode is the single point of failure of a Hadoop system
•  backup periodically to remote NFS (setup as part of Hadoop configuration)
•  use Secondary NameNode
•  not the same as the NameNode
•  periodically merges namespace with edit log and maintains a copy

[from Hadoop in Practice, Alex Holmes]
HDFS
Architecture

www.oralytics.com
t : @brendantierney

Files and Blocks
•  Files are split into blocks (single unit of storage)
•  Managed by Namenode, stored by Datanode
•  Transparent to user
•  Replicated across machines at load time
•  Same block is stored on multiple machines
•  Good for fault-tolerance and access
•  Can lead to inconsistent reads
•  Default replication is 3
Have you ever experienced
inconsistent reads?

www.oralytics.com
t : @brendantierney

HDFS File Writes

www.oralytics.com
t : @brendantierney

HDFS File Reads

www.oralytics.com
t : @brendantierney

Who is using Hadoop in Ireland ?
•  List of Cloudera customers in Ireland
•  Citi
•  Allianz
•  Deutsche Bank
•  Ulster Bank
•  dun & bradstreet

•  Ryanair
•  BT
•  Vodafone
•  Novartis
•  airbnb
•  Dell
•  Intel
•  Rockwell Automation
•  Revenue
•  Adecco
•  Experian
•  M&S

www.oralytics.com
t : @brendantierney

Discuss

Hadoop is not FREE J
vs
Hadoop is not FREE L

www.oralytics.com
t : @brendantierney

Something to think about

Overview of Hadoop and HDFS

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Overview of Hadoop and HDFS

Semelhante a Overview of Hadoop and HDFS (20)

Último

Último (20)

Overview of Hadoop and HDFS