Enviar pesquisa
Carregar
Sam fineberg big_data_hadoop_storage_options_3v9-1
•
3 gostaram
•
785 visualizações
Pramod Gosavi
Seguir
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 44
Baixar agora
Baixar para ler offline
Recomendados
Hadoop HDFS and Oracle
Hadoop HDFS and Oracle
Johan Louwers
Bigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
umapavankumar kethavarapu
Big data overview by Edgars
Big data overview by Edgars
Andrejs Vorobjovs
Hadoop
Hadoop
RittikaBaksi
Introduction to Hadoop
Introduction to Hadoop
Giovanna Roda
Hadoop
Hadoop
yasser hassen
Hadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
business Corporate
DataLogix Hadoop Solution
DataLogix Hadoop Solution
DataLogix B.V.
Recomendados
Hadoop HDFS and Oracle
Hadoop HDFS and Oracle
Johan Louwers
Bigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
umapavankumar kethavarapu
Big data overview by Edgars
Big data overview by Edgars
Andrejs Vorobjovs
Hadoop
Hadoop
RittikaBaksi
Introduction to Hadoop
Introduction to Hadoop
Giovanna Roda
Hadoop
Hadoop
yasser hassen
Hadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
business Corporate
DataLogix Hadoop Solution
DataLogix Hadoop Solution
DataLogix B.V.
Introduction to Hadoop part1
Introduction to Hadoop part1
Giovanna Roda
What is Hadoop?
What is Hadoop?
cneudecker
Hadoop Overview
Hadoop Overview
EMC
The Hadoop Ecosystem Table
The Hadoop Ecosystem Table
Dr. Volkan OBAN
Big data and hadoop
Big data and hadoop
Chanchal Tripathi
Hadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdf
Sheetal Jain
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
Hadoop
Hadoop
adm_exoplatform
Hw09 Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
Cloudera, Inc.
Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
Hadoop Presentation - PPT
Hadoop Presentation - PPT
Anand Pandey
Big data and hadoop product page
Big data and hadoop product page
Janu Jahnavi
Cloudera
Cloudera
Ahmed Salman
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
EMC
Introduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
Edureka!
Hadoop presentation
Hadoop presentation
Gabriel Răileanu
Seminar ppt
Seminar ppt
RajatTripathi34
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
Edureka!
Hadoop seminar
Hadoop seminar
KrishnenduKrishh
C12 group build lead impact
C12 group build lead impact
David Gullotti
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
INFINIDAT
Mais conteúdo relacionado
Mais procurados
Introduction to Hadoop part1
Introduction to Hadoop part1
Giovanna Roda
What is Hadoop?
What is Hadoop?
cneudecker
Hadoop Overview
Hadoop Overview
EMC
The Hadoop Ecosystem Table
The Hadoop Ecosystem Table
Dr. Volkan OBAN
Big data and hadoop
Big data and hadoop
Chanchal Tripathi
Hadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdf
Sheetal Jain
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
Hadoop
Hadoop
adm_exoplatform
Hw09 Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
Cloudera, Inc.
Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
Hadoop Presentation - PPT
Hadoop Presentation - PPT
Anand Pandey
Big data and hadoop product page
Big data and hadoop product page
Janu Jahnavi
Cloudera
Cloudera
Ahmed Salman
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
EMC
Introduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
Edureka!
Hadoop presentation
Hadoop presentation
Gabriel Răileanu
Seminar ppt
Seminar ppt
RajatTripathi34
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
Edureka!
Hadoop seminar
Hadoop seminar
KrishnenduKrishh
Mais procurados
(20)
Introduction to Hadoop part1
Introduction to Hadoop part1
What is Hadoop?
What is Hadoop?
Hadoop Overview
Hadoop Overview
The Hadoop Ecosystem Table
The Hadoop Ecosystem Table
Big data and hadoop
Big data and hadoop
Hadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdf
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop
Hadoop
Hw09 Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
Hadoop File system (HDFS)
Hadoop File system (HDFS)
Hadoop Presentation - PPT
Hadoop Presentation - PPT
Big data and hadoop product page
Big data and hadoop product page
Cloudera
Cloudera
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
Introduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
Hadoop presentation
Hadoop presentation
Seminar ppt
Seminar ppt
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
Hadoop seminar
Hadoop seminar
Destaque
C12 group build lead impact
C12 group build lead impact
David Gullotti
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
INFINIDAT
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
Challenges of Big Data Research
Challenges of Big Data Research
Regional Science Academy
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Edureka!
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigData
Ramakrishna Prasad Sakhamuri
Destaque
(8)
C12 group build lead impact
C12 group build lead impact
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
Enterprise Storage Solutions for Overcoming Big Data and Analytics Challenges
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
Challenges of Big Data Research
Challenges of Big Data Research
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigData
Semelhante a Sam fineberg big_data_hadoop_storage_options_3v9-1
Hadoop-2022.pptx
Hadoop-2022.pptx
MurindanyiSudi1
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
John Sing
Hadoop hdfs
Hadoop hdfs
Sudipta Ghosh
HDFS
HDFS
Vardhman Kale
Big Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
Hadoop .pdf
Hadoop .pdf
SudhanshiBakre1
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
Edureka!
Hdfs design
Hdfs design
Không còn Phù Hợp
Managing Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
Hadoop in action
Hadoop in action
Mahmoud Yassin
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Cloudera, Inc.
Hadoop basics
Hadoop basics
Laxmi Rauth
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim
Hadoop project design and a usecase
Hadoop project design and a usecase
sudhakara st
Unit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
Martin Ferguson
Hadoop training by keylabs
Hadoop training by keylabs
Siva Sankar
Hadoop - HDFS
Hadoop - HDFS
KavyaGo
Hadoop architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
Semelhante a Sam fineberg big_data_hadoop_storage_options_3v9-1
(20)
Hadoop-2022.pptx
Hadoop-2022.pptx
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop hdfs
Hadoop hdfs
HDFS
HDFS
Big Data and Hadoop Basics
Big Data and Hadoop Basics
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Hadoop .pdf
Hadoop .pdf
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
Hdfs design
Hdfs design
Managing Big data with Hadoop
Managing Big data with Hadoop
Hadoop in action
Hadoop in action
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop basics
Hadoop basics
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
Hadoop project design and a usecase
Hadoop project design and a usecase
Unit-3_BDA.ppt
Unit-3_BDA.ppt
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
Hadoop training by keylabs
Hadoop training by keylabs
Hadoop - HDFS
Hadoop - HDFS
Hadoop architecture-tutorial
Hadoop architecture-tutorial
Sam fineberg big_data_hadoop_storage_options_3v9-1
1.
PRESENTATION TITLE GOES
HEREBig Data Storage Options for Hadoop Sam Fineberg/HP Storage Division
2.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations and literature under the following conditions: Any slide or slides used must be reproduced in their entirety without modification The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations. This presentation is a project of the SNIA Education Committee. Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2
3.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Abstract Big Data Storage Options for Hadoop The Hadoop system was developed to enable the transformation and analysis of vast amounts of structured and unstructured information. It does this by implementing an algorithm called MapReduce across compute clusters that may consist of hundreds or even thousands of nodes. In this presentation Hadoop will be looked at from a storage perspective. The tutorial will describe the key aspects of Hadoop storage, the built-in Hadoop file system (HDFS), and some other options for Hadoop storage that exist in the commercial and open source communities. 3
4.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 4
5.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 5
6.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. What is Hadoop? A scalable fault-tolerant distributed system for data storage and processing Core Hadoop has two main components MapReduce: fault-tolerant distributed processing Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) Hadoop Distributed File System (HDFS): high-bandwidth clustered storage Distributed file system optimized for large files Operates on unstructured and structured data A large and active ecosystem Written in Java Open source under the friendly Apache License http://hadoop.apache.org 6
7.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. What is MapReduce? A method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer-created phases 1. Map 2. Reduce In between Map and Reduce is the Shuffle and Sort 7
8.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce 8 © Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html
9.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. What was the max temperature for the last century? MapReduce Operation 9
10.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Key MapReduce Terminology Concepts A user runs a client program (typically a Java application) on a client computer The client program submits a job to Hadoop The job is sent to the JobTracker process on the Master Node Each Slave Node runs a process called the TaskTracker The JobTracker instructs TaskTrackers to run and monitor tasks A task attempt is an instance of a task running on a slave node There will be at least as many task attempts as there are tasks which need to be performed 10
11.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce in Hadoop 11 © Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html Task Tracker Input (HDFS) Output (HDFS) Mapper Reducer Worker=Tasks
12.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. MapReduce: Basic Concepts Each Mapper processes single input split from HDFS Hadoop passes one record at a time to the developer’s Map code Each record has a key and a value Intermediate data written by the Mapper to local disk (not HDFS) on each of the individual cluster nodes intermediate data is reliable or globally accessible During shuffle and sort phase, all values associated with same intermediate key are transferred to same Reducer Reducer is passed each key and a list of all its values Output from Reducers is written to HDFS 12
13.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. What is a Distributed File System? A distributed file system is a file system that allows access to files from multiple hosts across a network A network filesystem (NFS/CIFS) is a type of distributed file system – more tuned for file sharing than distributed computation Distributed computing applications, like Hadoop, utilize a tightly coupled distributed file system Tightly coupled distributed filesystems Provide a single global namespace across all nodes Support multiple initiators, multiple disk nodes, multiple access to files – file parallelism Examples include HDFS, GlusterFS, pNFS, as well as many commercial and research systems 13
14.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 14
15.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Distributed File System - HDFS Architecture Java application, not deeply integrated with the server OS Layered on top of a standard FS (e.g., ext, xfs, etc.) Must use Hadoop or a special library to access HDFS files Shared-nothing, all nodes have direct attached disks Write once filesystem – must copy a file to modify it HDFS basics Data is organized into files & directories Files are divided into 64-128MB blocks, distributed across nodes Block placement is handled by the “NameNode” Placement coordinated with job tracker = writes always co-located, reads co- located with computation whenever possible Blocks replicated to handle failure, replica blocks can be used by compute tasks Checksums used to ensure data integrity Replication: one and only strategy for error handling, recovery and fault tolerance Self Healing Makes multiple copies (typically 3) 15
16.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS on local DAS A Hadoop cluster consisting of many nodes, each of which has local direct attached storage (DAS) Disks are running a standard file system (e.g., ext. xfs, etc.) HDFS blocks are stored as files in a special directory Disks attached directly, for example, with SAS or SATA No storage is shared, disks only attach to a single node The most common use case for Hadoop Original design point for Hadoop/HDFS Can work with cheap unreliable hardware Some very large systems utilize this model 16
17.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS on local DAS 17 … Compute nodes are part of HDFS, data spread across nodes HDFS Protocol …
18.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS File Write Operation 18 Image source: Hadoop,The Definitive Guide Tom White, O’Reilly 3-way replication
19.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS File Read Operation 19 Image source: Hadoop,The Definitive Guide Tom White, O’Reilly
20.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS on local DAS - Pros and Cons Pros Writes are highly parallel Large files are broken into many parts, distributed across the cluster Three copies of any file block, one written local, two remote Not a simple round-robin scheme, tuned for Hadoop jobs Job tracker attempts to make reads local If possible, tasks scheduled in same node as the needed file segment Duplicate file segments are also readable, can be used for tasks too Cons Not a replacement for general purpose storage Not a kernel-based POSIX filesystem Incompatible with standard applications and utilities (but future versions of Hadoop are adding more other application models) High replication cost compared with RAID/shared disk The NameNode keeps track of data location SPOF - location data is critical and must be protected Scalability bottleneck (everything has to be in memory) Improvements to NameNode are in the works 20
21.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Other HDFS storage options HDFS on Storage Area Network (SAN) attached storage A lot like DAS, Disks are logical volumes in storage array(s), accessed across a SAN HDFS doesn’t know the difference – Still appears like a locally attached disk SAN attached arrays aren’t the same as DAS Array has its own cache, redundancy, replication, etc. Any node on the SAN can access any array volume So a new node can be assigned to a failed node’s data 21
22.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS with SAN Storage 22 Storage Arrays Compute nodes Hadoop Cluster … iSCSI or FC SAN
23.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS File Write Operation 23 Array can provide redundancy, no need to replicate data across data nodes Array Replication
24.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. HDFS File Read Operation 24 Array redundancy, means only a single source for data
25.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. SAN for Hadoop Storage Instead of storing data on direct attached local disks, data is in one or more arrays attached to data nodes through a SAN Looks like local storage to data nodes Hadoop still utilizes HDFS Pros All the normal advantages of arrays RAID, centralized caching, thin provisioning, other advanced array features Centralized management, easy redistribution of storage Retains advantages of HDFS (as long as array is not over-utilized) Easy failover when compute node dies, can eliminate or reduce 3-way replication Cons Cost? It depends Unless if multiple arrays are used, scale is limited And with multiple arrays, management and cost advantages are reduced Still have HDFS complexity and manageability issues 25
26.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 26
27.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Other distributed filesystems Kernel-based tightly coupled distributed file system Kernel-based, i.e., no special access libraries, looks like a normal local file system These filesystems have existed for years in high performance computing, scale-out NAS servers, and other scale-out computing environments Many commercial and research examples Not originally designed for Hadoop like HDFS Location awareness is part of the file system – no NameNode Works better if functionality is “exposed” to Hadoop Compute nodes may or may not have local storage Compute nodes are “part of the storage cluster,” but may be diskless – i.e., equal access to files and global namespace Can tie the filesystem’s location awareness into task tracker to reduce remote storage access Remote storage is accessed using a filesystem specific inter-node protocol Single network hop due to filesystem’s location awareness 27
28.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Tightly coupled DFS for Hadoop General purpose shared file system Implemented in the kernel, single namespace, compatible with most applications (no special library or language) Data is distributed across local storage node disks Architecturally like HDFS Can utilize same disk options as HDFS – Including shared nothing DAS – SAN storage Some can also support “shared SAN” storage where raw volumes can be accessed by multiple nodes – Failover model – where only one node actively uses a volume, other can take over after failure – Multiple initiator model – where multiple nodes actively use a volume Shared nothing option has similar cost/performance to HDFS on DAS 28
29.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Distributed FS – local disks 29 Compute nodes are part of the DFS, data spread across nodes … Distributed FS inter-node Protocol …
30.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Distributed FS – remote disks 30 Compute nodes are distributed FS clients Scale out nodes are distributed FS servers Distributed FS inter-node Protocol …
31.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Remote DFS Write Operation 31 Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS
32.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Local DFS Write Operation 32 Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS
33.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Remote DFS Read Operation 33 Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS
34.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Local DFS Read Operation 34 Note that these diagrams are intended to be generic, and leave out much of the detail of any specific DFS
35.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Tightly coupled DFS for Hadoop Pros Shared data access, any node can access any data like it is local POSIX compatible, works for non-Hadoop apps just like a local file system Centralized management and administration No NameNode, may have a better block mapping mechanism Compute in-place, same copy can be served via NFS/CIFS Many of the performance benefits Cons HDFS is highly optimized for Hadoop, unlikely to get same optimization for a general purpose DFS Large file striping is not regular, based on compute distribution Copies are simultaneously readable Strict POSIX compliance leads to unnecessary serialization Hadoop assumes multiple-access to files, however, accesses are on block boundaries and don’t overlap Need to relax POSIX compliance for large files, or just stick with many smaller files Some DFS’s have scaling limitations that are worse than HDFS, not designed for “thousands” of nodes 35
36.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Overview Introduction What is Hadoop What is MapReduce How does Hadoop use storage Distributed filesystem concepts Storage options Native Hadoop – HDFS On direct attached storage On networked (SAN) storage Alternative distributed filesystems Cloud object storage Emerging options 36
37.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Cloud Object Storage for Hadoop Uses a REST API like CDMI, S3, or Swift HTTP based protocol, data is remote Objects are write once, read many, streaming access Objects have some stored metadata Data is stored in cloud object storage Could be local or across internet Cheap, high volume Systems utilize triple redundancy or erasure coding, for reliability Often uses Hadoop “S3” connector 37
38.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop on Object Storage 38 … Cloud Object Storage REST/HTTP
39.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Write on Object Storage 39
40.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Hadoop Read on Object Storage 40
41.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Object Storage for Hadoop Pros Low cost, high volume, reliable storage Good location for infrequently used “WORM” data Public cloud options Scalable storage Data can easily be shared between Hadoop and other applications Cons All data is remote – performance No data/compute colocation Limited capabilities, though a good match for Hadoop High disk cost if triple redundancy is used Good choice for large infrequently accessed WORM items that may need to be accessed by non-Hadoop jobs as well 41
42.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Emerging Options New options are emerging in the storage research community Caching – from enterprise storage Mirror to enterprise storage, NAS/NFS SSD Improvements to HDFS HA options Access to non-Hadoop jobs Bottom line The limitations of HDFS are known Work is ongoing to improve Hadoop storage options 42
43.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Summary Hadoop provides a scalable fault-tolerant environment for analyzing unstructured and structured information The default way to store data for Hadoop is HDFS on local direct attached disks Alternatives to this architecture include SAN array storage, tightly-coupled general purpose DFS, and cloud object storage They can provide some significant advantages However, they aren’t without their downsides, its hard to beat a filesystem designed specifically for Hadoop Which one is best for you? Depends on what is most important – cost, manageability, compatibility with existing infrastructure, performance, scale, … 43
44.
Big Data Storage
Options for Hadoop © 2013 Storage Networking Industry Association. All Rights Reserved. Attribution & Feedback 44 Please send any questions or comments regarding this SNIA Tutorial to tracktutorials@snia.org The SNIA Education Committee thanks the following individuals for their contributions to this Tutorial. Authorship History Sam Fineberg,August 2012 Updates: Sam Fineberg, February 2013 Sam Fineberg, March 2013 Additional Contributors Rob Peglar Joseph White Chris Santilli
Baixar agora