7. emc isilon hdfs enterprise storage for hadoop

1© Copyright 2011 EMC Corporation. All rights reserved.
EMC Isilon HDFS –
Enterprise Storage for
Hadoop
Featuring EMC Isilon Scale-Out NAS
Storage
Shai Harmelin
EMC System Enginer – Isilon Specialist
May 21, 2013

Today’s Agenda
• EMC Isilon Background
• HDFS Architectural Challenges
• Isilon HDFS Benefits
• Performance Comparison
• Customer Case Study
• Q+A

EMC Isilon
Setting the standard for scale-out NAS
• Founded in 2000 as the leader in Scaleout NAS (Gartner 2010)
• Broad adoption across many markets
– High Performance Computing (HPC): Life Sciences, Oil & Gas, Electronic
Design Automation, Media & Entertainment, Financial Services
– Enterprise IT: Archive, Home Directories, File Shares, Virtualization,
Business Analytics
• Acquired by EMC in 2011 for $2.5B
• Over 3,500 global customers
• Isilon OneFS: Seventh generation, industry-proven, innovative
scale-out operating environment
• 2012 – EMC Isilon is Industry’s First Scale-Out NAS System with Native
HDFS Support

Isilon Growing Momentum
3,500+ customers

Why Hadoop is Important to EMC
Isilon Customers
Pragmatic approach to analytics on a very large scale
– Opens up new ways of gaining insights and identifying
opportunities for businesses
Designed to address the rise of unstructured data
– Enterprise data to grow by 650% over next 5 years
– More than 80% of this growth will be unstructured data
Hadoop is only ONE component of
Enterprise Big Data Analytics PIPELINE

Isilon Scale-Out NAS Architecture
OneFS Operating
Environment
Intra-cluster
Communication Layer
Servers
Client/Application Layer Ethernet Layer
Servers
Servers
SingleFS/Volume
CIFSNFS
FTPHTTP
HDFS
for
Hadoop

Isilon Core Innovation
OneFS scale-out operating system
Single File System
Simplicity
Leadership Efficiency
High Performance
Easy Growth
Automated Tiering
Linear Scalability

Largest and Most Scalable File System
500X More Scalable than Traditional Storage Systems
OneFS™ can scale from 18TB to over 20,000 TB in a
single file system
•
•
•

AutoBalance
Automated data balancing across nodes reduces costs,
complexity and risks for scaling storage
“Using Software to do Work Unfit for Humans”
• AutoBalance migrates
content to new storage nodes
while system is
online and in production
• Requires NO manual
intervention, NO
reconfiguration,
NO server or client mount point
or application changes
• Eliminate “Hot Spots”
EMPTY
EMPTY
EMPTY
EMPTY
EMPTY
FULL
FULL
FULL
FULL
BALANCED
BALANCED
BALANCED
BALANCED
BALANCED

• Load balancing
• Seamless failover
• Performance zones
• Quota
management
• Thin provisioning
• High speed replication
• Disaster recovery
• Business continuance
• Instant recovery
• Data protection
Isilon, Scale-Out NAS for Big Data
Single File System, Single Volume Simplicity For Active,
Persistent, And Archive Data
WAN/LAN
Primary &
Nearline Storage
Local/Remote
Archive
Client/Application
Layer
Virtualized Servers
Virtualized
Servers
Clients
X-series
Network
NL-series
• File immutability
• Protection from
deletion/change
NL-series
Backup
Accelerator
S-series
• Automated
storage tiering

12© Copyright 2011 EMC Corporation. All rights reserved. Back to Navigation
Easiest Storage System to Manage
Single-level of
Management
Manage a 18TB to 10PB
single file system from
one intuitive console
"Isilon has made some very
bold claims with respect to its
clustered storage products -
not least the idea of
genuinely revolutionizing the
ease and speed with which
mass storage - over 500
Terabytes - can be added and
managed thereafter. We have
conducted rigorous testing
and unanimously agree with
their assertions. This stuff
is almost frighteningly
simple to use.”
Steve Broadhead, Founder,
Broadband-Testing
Laboratories

HDFS Overview

Secondary NameNode
DataNode / Task TrackerJob Tracker
NameNode
Core Hadoop Components

Job Tracker
Manages all the jobs to the cluster
Tracks and reports the status of jobs and tasks
Provides job queuing functionality
Communicates with NameNode and tries to align TaskTracker to Data Nodes
The compute workhorse
Serves read/write requests from the clients
Executes Map/Reduce tasks
Typically performs I/O against local or remote DataNodes
Task Tracker
Compute Components

NameNode
Manages the file system namespace
Stores all the Metadata in the RAM – a
limitation on file system size
Filenames, owners, group, access info
Knows associated blocks
Manages block replication across
DataNodes
Manages edit log and check-
pointing of name node metadata
Does not provide name node hot
failover
CDH4 has a solution for this, but
is not in full scale production in
most environments
Secondary NameNode
Stores blocks of files on top of native host OS file system (e.g. EXT3, XFS, ZFS)
Same block is stored on multiple DataNodes for redundancy
Has no “awareness” of data blocks living elsewhere (only the namenode does)
DataNode
File System
Components

Enterprise Challenges of Hadoop
Hadoop DAS Environment
1
Dedicated Storage Infrastructure
– One-off for Hadoop only
2
Single Point of Failure
– Namenode
3
Lacking Enterprise Data Protection
– No Snapshots, replication, backup
4
Poor Storage Efficiency
– 3X mirroring
5
Fixed Scalability
– Rigid compute to storage ratio
6
Manual Import/Export
– No protocol interoperability support
Name node

Enterprise Challenges of Hadoop
Hadoop DAS Environment
1
2
– Namenode
3
4
– 3X mirroring
5
Fixed Scalability
6
– No protocol support
1x
1x
2x
2x
3x
2x
3x
3x
1x
Namenode

Isilon HDFS Support
Isilon supports the HDFS
interfaces for the NameNode
and DataNode to host and
metadata and data
Underlying file system is
OneFS
As simple as pointing the
Hadoop Nodes to the DNS
name of the Isilon cluster!

HDFS is a protocol!
Each Isilon node now “speaks” the HDFS NameNode and
DataNode protocol
We eliminate need to run these services on the Hadoop compute
cluster
Every Isilon node acts as both a namenode and datanode
(isi_hdfs_d)
Data is laid out within OneFS exactly the same as for NFS, SMB,
etc.
Data is protected just like any other data in the Isilon File
System. No Mirroring, only Parity = 80% utilization
All Isilon Enterprise Features are applied to Hadoop data:
Snapshots, Replication, SmartCache, SmartLock, etc…

HDFS Writes on Isilon
Jobtracker asks Isilon namenode (isi_hdfs_d) “tell me where to
place /path/file”
OneFS isi_hdfs_d hands JT list of 3 “datanode” addresses for
each block (aligned to block size defined on Hadoop cluster)
Jobtracker assigns task tracker to communicate to data-node
(isi_hdfs_d) to write each data block (an abstraction in our case)
When complete, isi_hdfs_d responds by saying the block is
replicated (a lie) because Data is striped like any other file,
written over any protocol.
HDFS files are laid out on Isilon File Systems (IFS) similarly to any other
protocol (NFS, CIFS, FTP)
File can be written over NFS (nfsd) or CIFS (lwiod) and accessed
over HDFS (isi_hdfs_d)

HDFS Reads on Isilon
Jobtracker asks Isilon namenode (isi_hdfs_d) “tell me where
/path/file lives”
isi_hdfs_d responds with list of block addresses (3 datanode IP’s
per block). Note that the blocksize in this case is configurable
on isilon (default 64MB)
Jobtracker assigns task trackers to read each block (first address
out of 3 for each)
Tasks within each task tracker ask namenode (again) for block
locations, then initiate I/O transactions to read the data over the
network
The concept of locality is eliminated accept for rack awareness.

Isilon HDFS Settings

How EMC Isilon Addresses the Hadoop
Challenge
1
2
– Namenode
3
4
– 3X mirroring
5
Fixed Scalability
6
– No protocol support
1
Scale-Out Storage Platform
– Multiple applications & workflows
2
No Single Point of Failure
– Distributed Namenode
3
End-to-End Data Protection
– SnapshotIQ, SyncIQ, NDMP Backup
4
Industry-Leading Storage Efficiency
– >80% Storage Utilization
5
Independent Scalability
– Add compute & storage separately
6
Multi-Protocol
– Industry standard protocols
– NFS, CIFS, FTP, HTTP, HDFS

Distributed (Clustered) Name Node When Using Isilon
MTTDL = 5,000 years
Metadata stored across
systems same way as
standard file metadata
Built-in clustered redundancy
across many nodesName Node
Clustering the
NameNode on
Isilon allows
for the failure
protection
level Isilon
already
provides
ClusteredNameNode

Fixed Scaling / Independent Scaling
Hadoop
Isilon
Storage to Compute ratio is fixed
Scaling compute means scaling
capacity
Difficult to provide QoS
Compute upgrade is a forklift
Scale compute independent of
storage
Achieve optimal performance
balance even as workloads evolve
No data migrations, ever!
Add new performance as
hardware evolves
storage
compute
Desired
performance/
capacity

Protocol Support
Servers
Servers
Servers
Before
After
HDFS is not visible to
Windows, Unix, Linux,
Apple, or any other file
system natively
Big Data is only used for
Big Data
Inherent Multi-Protocol
Support in Isilon allows
ubiquitous access to all
file systems including
Hadoop
Big Data is actual data!
Servers

Data Center Network
Time-to-Results
Data Copy Analysis In-Place Analysis
Existing Primary Storage
Hadoop on a Stick
Have you ever
copied 100TB from
Primary Storage to
a Hadoop system?
How long does it
take ≈ to copy
100TB from one
place to another
over a 10GB link?
>24 Hours
Data Center Network
Existing Primary Storage
Hadoop Processing Nodes
Reading relevant
data to analysis

Snapshot/Version Control
Before
After
Traditional HDFS does not
have replication
No Snapshotting of data
Loss of Version control
Not designed for Mission
Critical
Full Snapshot IQTM
integration identifies
changes
Multi-threaded, Multi-Node
Scale-Out replication
Improved RPO/RTO for
business continuity
Geo-replicated Hadoop!
5 5

Hadoop Distributions Support on Isilon HDFS
• Available now in 7.0.1.5
• Multiple HDFS:// namespaces
– hdfs://DAS + hdfs://isilon
– Potential for archive/tiering
– Hadoop cluster version mixing
• Distributions:
– Cloudera CDH4.x
– Hortonworks HDP-2
– PivotalHD 1.0 (aka: GPHD 2.0)
– Apache 0.23 / apache 2.0
HDFS v2HDFS v1

Performance

Test Used HiBench
Developed by Intel and Open Sourced
– Collection of standard Hadoop jobs
– Our tests focused on TeraSort and TestDFSIO
All results normalized as throughput per node to allow comparison of differing
configs
TestDFSIO tests were uncompressed, which shows actual I/O efficiency
– Compressed gives much higher performance, but is not actual I/O

GPHD-Isilon is Highly Competitive

Terasort Performance is Comparable
Between Configurations

I/O Performance Scales As Isilon Nodes
Are Added

For Typical Workloads, 1.5 Compute
Nodes Per Isilon x400 Node is Good
(4) Isilon x400
Nodes Tested

Return Path
http://www.emc.com/collateral/customer-profiles/h11528-return-path-cp.pdf
Challenges
Limited performance and capacity to support intensive Hadoop analytics
NFS and Hadoop environments struggled to handle unique data sets comprised of
hundreds of millions of small email files, and large analytics files, which hindered
analytics and delivery of customer solutions
25 different DAS and NAS storage systems lacked performance and capacity
Storage projected to increase from 150TB to 2PB over the next 5 years
Company background:
• Return Path is the worldwide leader in email intelligence, serving Internet
service providers (ISPs), businesses, and individuals.
• The company’s email intelligence solutions process and analyze massive volumes
of data to maximize email performance, ensure email delivery, and protect users
from spam and other abuse.
• Developed Hadoop based email intelligence solutions combined with NAS based
data access

Return Path
Results
Return Path now has a single repository for all its Big Data, accessible to email
analysts, product development teams and external customers.
Isilon delivers real-time data to Return Path’s end-user applications while
providing seamless integration with Hadoop for back-end data analytics
Reduces shared storage data center footprint by 30 percent
Shortens weekly administration time by more than 35 percent
Improves availability and reliability for Hadoop analytics
Savings of $350,000 from lower power, cooling, and maintenance
Isilon Solution and Benefits
Solution
Isilon X400 Scaleout NAS – Approx 200TB capacity
SmartConnect, SmartQuotas, InsightIQ Software suite
NFS and HDFS Data Access Protocols

Return Path
“To have all this data being generated by our email intelligence products, but no way
to access it directly by Hadoop, was a major hindrance,”
“Isilon serves NFS data across multiple product suites and makes it easily accessible to
our Hadoop analytics team. That’s a significant business enabler, allowing Return Path to
develop customer solutions much faster.”
“Isilon InsightIQ software has been invaluable, providing visibility into our infrastructure
and managing our space efficiently as we grow.”
DIZ CARTER
VP Infrastructure
Operations
Customer Quotes

Questions?

Thank You!

7. emc isilon hdfs enterprise storage for hadoop

7. emc isilon hdfs enterprise storage for hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a 7. emc isilon hdfs enterprise storage for hadoop

Semelhante a 7. emc isilon hdfs enterprise storage for hadoop (20)

Mais de Taldor Group

Mais de Taldor Group (12)

Último

Último (20)

7. emc isilon hdfs enterprise storage for hadoop

Notas do Editor