SlideShare uma empresa Scribd logo
1 de 11
HDF for the Cloud 1
John Readey
The HDF5 data format 2
• Established 20 years ago the HDF5 file format is the most commonly used
format in Earth Science
• Note: NetCDF4 files are actually HDF5 “under the hood”
• HDF5 was designed with the (somewhat contradictory) goals of:
• Archival format – data that can stored for decades
• Analysis Ready -- data that can be directly utilized for analytics (no
conversion needed)
• There’s a rich set of tools and language SDKs:
• C/C++/Fortran
• Python
• Java, etc.
HDF5 File Format meets the Cloud 3
• Storing large HDF5 collection on AWS is almost always about utilizing S3:
• Cost effective
• Redundant
• Sharable
• It’s easy enough to store HDF5 files as S3 objects, but these files can’t be
read using the HDF5 library (which is expecting a POSIX filesystem)
• Experience using FUSE to read from S3 using HDF5Library has not tended
to work so well
• In practice users have been left with copying files to local disk first
• This has led to interest in alternative formats such as Zarr, TileDB, and
our own HSDS S3 Storage Schema (more on that later)
HDF5 meets S3 halfway… 4
• For many years the HDF5 library has supported VFDs “Virtual File Driver”
• VFDs are low-level plugins that can replace the standard POSIX IO
methods with anything the developer of the VFD would like
• The HDF Group has developed a VFD specifically for S3 that will be included
in the next library release (coming soon!)
• How it works: Each POSIX read call is replaced with a S3 Range GET
• Features:
• Can read any HDF5 file (write is not supported)
• No changes to the public API
• Compatible with higher-level libraries (h5py, netcdf, xarray, etc.)
• This is a first release and there are some ideas for improving performance in
subsequent releases
• It will be very helpful to come up with an objective set of benchmarks to
compare performance between S3VFD, HSDS, Zarr, etc.
Cloud Optimized HDF
• For anyone putting HDF5 files on S3 for in-place reading, there are a few things that
can be done to improve performance when accessed using the S3VFD (or FUSE)
• Most of these optimizations can be done using existing tools (e.g. h5repack)
• A Cloud Optimized HDF5 files is still an HDF5 file and can be downloaded and read
with native VFD if desired
• Initial Proposal (likely to be revised based on testing):
• Use chunking for datasets larger than 1MB
• Use “brick style” chunk layouts (enable slicing via any dimension)
• Use readily available compression filters
• Pack metadata in front of file
• Aggregate smaller files into larger ones
HDF Server 6
• HSDS (now HDF Kita Server) is a REST based service for HDF data developed by the HDF
Group
• Think of it as HDF gone cloud native. 
• HSDS Features:
• Runs as a set of containers on Kubernetes – so can scale beyond one machine
• Requests can be parallelized across multiple containers
• Feature compatible with the HDF library but is independent code base
• Supports multiple readers/writers
• Uses S3 as data store
• Available now as part of HDF Kita Lab (our hosted Jupyter environment):
https://hdflab.hdfgroup.org
• Will be available on AWS Marketplace soon
HDF Cloud Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
• Limit maximum storage object size
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
Legend:
• Dataset is partitioned into
chunks
• Each chunk stored as an S3
object
• Dataset meta data (type, shape,
attributes, etc.) stored in a
separate object (as JSON text)
How to store HDF5 content in S3?
Each chunk (heavy outlines) get
persisted as a separate object
8Dataset JSON Example
creationProperties contains HDF5 dataset
creation property list settings.
Id is the objects UUID.
Layout represents HDF5 dataspace.
Root points back to the root group
Created & lastModified are timestamps
type represents HDF5 datatype.
attributes holds a list of HDF5 attribute
JSON objects.
{
"creationProperties": {},
"id": "d-9a097486-58dd-11e8-a964-
0242ac110009",
"layout": {"dims": [10], "class":
"H5D_CHUNKED"},
"root": "g-952b0bfa-58dd-11e8-a964-
0242ac110009",
"created": 1526456944,
"lastModified": 1526456944,
"shape": {"dims": [10], "class":
"H5S_SIMPLE"},
"type": {"base": "H5T_STD_I32LE",
"class": "H5T_INTEGER"},
"attributes": {}
}
Schema Details 9
• Key dispersal
• Objects are stored “flat” – no hierarchy
• UUIDs have a 5 char hash added to the front
• Idea is the evenly distribute objects across S3 storage nodes to improve
performance
• S3 partitions objects by first few characters of the key name
• Each storage node is limited to about 300 req/s
• There’s no list of chunks
• Chunk key is determined based on chunk position in the data space
• E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset
• Chunk objects get created as needed on first write
• Schema is currently used just by HDF Server, but could just as easily be
used directly by clients (assuming that writes don’t conflict)
Supporting traditional HDF5 files 1
0
• Downside of the HDF S3 Schema is that data needs be transmogrified
• Since the bulk of the data is usually the chunk data it makes sense to
combine the ideas of the S3 Schema and S3VFD:
• Convert just the metadata of the source HDF5 file to the S3 Schema
• Store the source file as a S3 object
• For data reads, metadata provides offset and length into the HDF5 file
• S3 Range GET returns needed data
• This approach can be used either directly or with HDF Server
• Compared with the pure S3VFD approach, you reduce the number of S3
requests needed
• Work on supporting this is planned for later this year
References 1
1
• HDF Schema:
https://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf
• SciPy2017 talk:
https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf
• AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power-
from-wind-open-data-on-aws/
• AWS S3 Performance guidelines:
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-
considerations.html

Mais conteúdo relacionado

Mais procurados

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Alluxio, Inc.
 
Hadoop 2 cluster architecture
Hadoop 2 cluster architectureHadoop 2 cluster architecture
Hadoop 2 cluster architectureSandeep Patil
 

Mais procurados (20)

H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF ConverterHDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF Converter
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
 
Hadoop 2 cluster architecture
Hadoop 2 cluster architectureHadoop 2 cluster architecture
Hadoop 2 cluster architecture
 

Semelhante a HDF for the Cloud

A quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE ClaritasA quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE ClaritasGuy Maslen
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 

Semelhante a HDF for the Cloud (20)

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
A quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE ClaritasA quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE Claritas
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 

Mais de The HDF-EOS Tools and Information Center

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

Mais de The HDF-EOS Tools and Information Center (16)

The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
 
HDF Status Update
HDF Status UpdateHDF Status Update
HDF Status Update
 
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
NASA Terra Data Fusion
 
S3 VFD
S3 VFDS3 VFD
S3 VFD
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 

Último

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 

Último (20)

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 

HDF for the Cloud

  • 1. HDF for the Cloud 1 John Readey
  • 2. The HDF5 data format 2 • Established 20 years ago the HDF5 file format is the most commonly used format in Earth Science • Note: NetCDF4 files are actually HDF5 “under the hood” • HDF5 was designed with the (somewhat contradictory) goals of: • Archival format – data that can stored for decades • Analysis Ready -- data that can be directly utilized for analytics (no conversion needed) • There’s a rich set of tools and language SDKs: • C/C++/Fortran • Python • Java, etc.
  • 3. HDF5 File Format meets the Cloud 3 • Storing large HDF5 collection on AWS is almost always about utilizing S3: • Cost effective • Redundant • Sharable • It’s easy enough to store HDF5 files as S3 objects, but these files can’t be read using the HDF5 library (which is expecting a POSIX filesystem) • Experience using FUSE to read from S3 using HDF5Library has not tended to work so well • In practice users have been left with copying files to local disk first • This has led to interest in alternative formats such as Zarr, TileDB, and our own HSDS S3 Storage Schema (more on that later)
  • 4. HDF5 meets S3 halfway… 4 • For many years the HDF5 library has supported VFDs “Virtual File Driver” • VFDs are low-level plugins that can replace the standard POSIX IO methods with anything the developer of the VFD would like • The HDF Group has developed a VFD specifically for S3 that will be included in the next library release (coming soon!) • How it works: Each POSIX read call is replaced with a S3 Range GET • Features: • Can read any HDF5 file (write is not supported) • No changes to the public API • Compatible with higher-level libraries (h5py, netcdf, xarray, etc.) • This is a first release and there are some ideas for improving performance in subsequent releases • It will be very helpful to come up with an objective set of benchmarks to compare performance between S3VFD, HSDS, Zarr, etc.
  • 5. Cloud Optimized HDF • For anyone putting HDF5 files on S3 for in-place reading, there are a few things that can be done to improve performance when accessed using the S3VFD (or FUSE) • Most of these optimizations can be done using existing tools (e.g. h5repack) • A Cloud Optimized HDF5 files is still an HDF5 file and can be downloaded and read with native VFD if desired • Initial Proposal (likely to be revised based on testing): • Use chunking for datasets larger than 1MB • Use “brick style” chunk layouts (enable slicing via any dimension) • Use readily available compression filters • Pack metadata in front of file • Aggregate smaller files into larger ones
  • 6. HDF Server 6 • HSDS (now HDF Kita Server) is a REST based service for HDF data developed by the HDF Group • Think of it as HDF gone cloud native.  • HSDS Features: • Runs as a set of containers on Kubernetes – so can scale beyond one machine • Requests can be parallelized across multiple containers • Feature compatible with the HDF library but is independent code base • Supports multiple readers/writers • Uses S3 as data store • Available now as part of HDF Kita Lab (our hosted Jupyter environment): https://hdflab.hdfgroup.org • Will be available on AWS Marketplace soon
  • 7. HDF Cloud Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects • Limit maximum storage object size • Support parallelism for read/write • Only data that is modified needs to be updated • Multiple clients can be reading/updating the same “file” Legend: • Dataset is partitioned into chunks • Each chunk stored as an S3 object • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) How to store HDF5 content in S3? Each chunk (heavy outlines) get persisted as a separate object
  • 8. 8Dataset JSON Example creationProperties contains HDF5 dataset creation property list settings. Id is the objects UUID. Layout represents HDF5 dataspace. Root points back to the root group Created & lastModified are timestamps type represents HDF5 datatype. attributes holds a list of HDF5 attribute JSON objects. { "creationProperties": {}, "id": "d-9a097486-58dd-11e8-a964- 0242ac110009", "layout": {"dims": [10], "class": "H5D_CHUNKED"}, "root": "g-952b0bfa-58dd-11e8-a964- 0242ac110009", "created": 1526456944, "lastModified": 1526456944, "shape": {"dims": [10], "class": "H5S_SIMPLE"}, "type": {"base": "H5T_STD_I32LE", "class": "H5T_INTEGER"}, "attributes": {} }
  • 9. Schema Details 9 • Key dispersal • Objects are stored “flat” – no hierarchy • UUIDs have a 5 char hash added to the front • Idea is the evenly distribute objects across S3 storage nodes to improve performance • S3 partitions objects by first few characters of the key name • Each storage node is limited to about 300 req/s • There’s no list of chunks • Chunk key is determined based on chunk position in the data space • E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset • Chunk objects get created as needed on first write • Schema is currently used just by HDF Server, but could just as easily be used directly by clients (assuming that writes don’t conflict)
  • 10. Supporting traditional HDF5 files 1 0 • Downside of the HDF S3 Schema is that data needs be transmogrified • Since the bulk of the data is usually the chunk data it makes sense to combine the ideas of the S3 Schema and S3VFD: • Convert just the metadata of the source HDF5 file to the S3 Schema • Store the source file as a S3 object • For data reads, metadata provides offset and length into the HDF5 file • S3 Range GET returns needed data • This approach can be used either directly or with HDF Server • Compared with the pure S3VFD approach, you reduce the number of S3 requests needed • Work on supporting this is planned for later this year
  • 11. References 1 1 • HDF Schema: https://s3.amazonaws.com/hdfgroup/docs/obj_store_schema.pdf • SciPy2017 talk: https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pdf • AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power- from-wind-open-data-on-aws/ • AWS S3 Performance guidelines: https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf- considerations.html

Notas do Editor

  1. Many users of HDF5 are now migrating data archives to public or private cloud systems. The access approaches and performance characteristics of cloud storage are fundamentally different than traditional data storage systems because 1) the data are accessed over http and 2) the data are stored in an object store and identified using unique keys. There are many different ways to organize and access data in the cloud. The HDF Group is currently exploring and developing approaches that will facilitate migration to the cloud and support many existing HDF5 data access use cases. Our goal is to protect data providers and users from disruption as their data and applications are migrated to the cloud.
  2. This idea has been kicking around for a while, but storing potentially millions of files on a Linux filesystem would be problematic. Using S3 as the storage vehicle is a natural fit since there’s no limit to the number of objects in a bucket. With NREL we’ve validated this approach to 50 TB’s of data over 27MM objects (see aws-big-data blog article: https://aws.amazon.com/blogs/big-data/power-from-wind-open-data-on-aws/ )