An update on HDF, including a status report on the HDF Group, an overview of recent changes to the HDF4 and HDF5 libraries and tools, plans for future releases, HDF Group projects and collaborations, and future plans.
6. NASA Commits …
• “The HDF Group has received a 3-year contract from
NASA to provide ongoing development and support for
the HDF technologies used by NASA’s Earth Observing
System.
• The project continues the relationship that was first
established in 1994, when HDF was selected as the
standard format for the EOS Data and Information
System (EOSDIS).
• Since that time, over 4 petabytes of mission data and
derived data products have been stored in HDF4 and
HDF5, with an estimated 1.6 million users.
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
6
7. • Under the new contract, The HDF Group will
support NASA’s EOS program in five critical
areas:
− Provide user support to EOS data providers and
data consumers
− Perform software development and quality
assurance
− Assure long-term access to HDF data
− Integrate with complementary technologies and
applications
− Advise follow-on earth systems projects
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
7
8. What is
The HDF Group
And why does it exist?
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
8
9. History of The HDF Group
• 18 Years at University of Illinois National Center
for Supercomputing Applications
• Spun-off from University July 2006
• Non-profit
• 20+ scientific, technology, professional staff
• Intellectual property:
− The HDF Group owns HDF4 and HDF5
− HDF formats and libraries to remain open
− BSD-type license
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
9
10. The HDF Group Mission
To ensure long-term
accessibility of HDF data
through sustainable
development and support of
HDF technologies.
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
10
11. Goals
• Maintain, evolve HDF for sponsors and
communities that depend on it
• Provide consulting, training, tuning,
development, research
• Sustain the group for long term to assure data
access over time
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
11
12. The HDF Group Services
•
Helpdesk and Mailing Lists
− Available to all users as a first level of support
•
Standard Support
− Rapid issue resolution support
•
Consulting
− Needs assessment, troubleshooting, design reviews, etc.
•
Enterprise Support
− Coordinating HDF activities across departments
•
Special Projects
− Adapting customer applications to HDF
− New features and tools, with changes normally incorporated into open
source product
− Research and Development
•
Training
− Tutorials and hands-on practical experience
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
12
13. Members of the HDF support community
•
•
•
•
•
•
NASA
Sandia National Laboratory (2)
University of Illinois/NCSA
A leading U.S. aerospace company
NOAA Science Data Stewardship
New projects and partners
− A major product lifecycle management company
− A bioinformatics software company
− Engineering Research and Development Center –
Topographic Engineering Center
− NPOESS
− ITT VIS
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
13
14. Initiatives and areas of increased interest
•
•
•
•
•
•
•
Bioinformatics
High performance computing (HPC)
Microsoft products (HPC, .NET, others)
Database integration
Improving concurrency
Performance and storage efficiency
Improving high level language support
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
14
17. Overview of basic library releases
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
17
18. HDF5 1.8.0 (Feb 08)
• Major release with file format changes and
features.
• File format changes affect backward/forward
compatibility with previous releases.
• See "New Features in Release 1.8.0 and Format
Compatibility Considerations”
http://hdfgroup.org/HDF5/doc/ADGuide/CompatFormat180.html
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
18
19. HDF5 1.8 minor releases
• 1.8.1 (May 08)
− A minor release with bug fixes
− Provided 1.8 full support for Fortran applications
− Enhanced tools with 1.8.0 features
• HDF5 1.8.2 coming Nov 08
− Minor bug fixes
− Tool enhancements
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
19
20. HDF5 1.6 minor releases
• 1.6.7 (Feb 08)
− Modification to address Aura issue
• 1.6.8 coming Nov 08
− Minor bug fixes
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
20
21. Future HDF5 releases (highlights)
• Release HDF5 1.10.0
−
−
−
−
Performance improvements
Some new features
Support for Fortran 2003 features
Target date November 2009
• When to drop support for 1.6.* ?
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
21
22. HDF 4 minor releases
• 4.2r3 (Feb 08)
− Improved support for apps using HDF4 and NetCDF3
− Improved support for data sets and coordinate
variable with the same names
• Release HDF4r2.4 coming Nov 08
− Minor bug fixing, tools enhancements
− Support for C shared libraries
− Support for 32-bit version on Mac Intel
• http://hdfgroup.org/products/hdf4/
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
22
23. H4-H5 Conversion Software 2.0 (May)
• Re-built with HDF5 1.8.1 and HDF 4.2r3.
• Conversion tool h4toh5 enhanced
− Converts HDF-EOS2 files to HDF5 files
− Makes HDF5 files readable by NetCDF4
http://hdfgroup.org/h4toh5/
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
23
25. HDF-EOS2 and HDF-EOS5
• Auto configuration for HDF-EOS2 and HDF-EOS5
− Compile and test libraries with automatic
configuration tools
− Thank you, Abe!
• Testing of EOS2 and EOS5
− Test daily with HDF4 and HDF5 development code
− Periodically test on EOS-critical platforms
• EOS website support
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
25
27. h5check 1.0 (March 2008)
• A validation tool to verify whether an HDF5 file
is encoded according to the HDF5 File Format
Specification.
• To ensure format integrity and long-term
compatibility between versions of the HDF5
library.
• By default, the file is verified against 1.8.x.
Can also verify against 1.6.x.
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
27
28. Major Improvements for Existing Tools
• Improved handling of large datasets by h5diff,
h5repack, hdiff, and hrepack
• Other added capabilities
−
−
−
−
H5import: to import strings
H5diff: to deal with NaN values
H5dump: to dump objects in requested order
H5repack:
• To apply multiple filters to all objects
• To add a userblock
• To align datasets in file at byte offsets that support
efficient access
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
28
29. In the works: h52jpeg
• Converts datasets in an HDF5 file to a jpeg image.
• Prototype available, if you are interested.
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
29
30. Please send us your
comments and requests
regarding the HDF4 and
HDF5 library and tools
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
30
32. HDF Java
• HDF-Java 2.5 release
− Beta 1 Release Feb 08
− Full release planned for Dec. 2008
• HDF5 JNI updated for HDF5 1.8.x with 1.6 flag
• Binary for 32-bit Linux and 64-bit Solaris
• Also added daily testing added for hdf-java
products
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
32
33. Also in the pipeline
• Full Java Support for HDF5 1.8.x
− Add and test new functions in Java wrapper
− Implement and test new functions in C JNI
− Use new functions in HDF-Java objects
• Add many new features
• Improve performance
• Revise HDFView User’s Guide
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
33
35. Surviving a System Failure
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
35
36. Surviving a System Failure in HDF5
• Problem:
− In the event of an application or system crash, data
in HDF5 files are susceptible to corruption
− Corruption can occur if structural metadata is being
written when the crash occurs
• Initial Objective:
− Guarantee an HDF5 file with consistent metadata
can be reconstructed in the event of a crash
− No guarantee on state of raw data – contains
whatever data made it to disk prior to crash
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
36
38. Faster HDF5 Data Appends
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
40
39. Fast Data Appends
• Problem: Metadata operations limit the rate at
which HDF5 can append data to datasets.
• Solution: new data structure for indexing chunks:
− Allows constant time extend, shrink and lookup of
chunks in datasets with single unlimited dimension
− # of metadata I/O operations to append to dataset
is independent of # of chunks
− Also allows single-writer/multiple-reader access
• Details at:
http://hdfgroup.uiuc.edu/RFC/HDF5/ReviseChunks/
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
41
41. HDF Performance Framework
• A tool for
−
−
−
−
Testing on multiple platforms
Testing different versions
Long term regression testing
Assistance in debugging
• New for 1.8:
− API and format versioning
− Improved reporting interfaces
• Future related work
− Quality monitoring of the software, such as code
coverage, memory usage
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
43
43. Library Features
• Improved external link support
− External link: link to HDF5 object in another file
− Can more easily specify path lookup of external
files
− Adding external link support for h5ls and h5dump
• Time datatype improvements
− Expand time type to support native formats better
− Adapt tools to display them properly
• Port to OpenVMS (limited support)
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
45
44. Improving performance
• Faster file free-space management while file open
• Many transactions can create many holes
• Free space management recovers unused space
• Up to 38x improvement in experiments
• Direct I/O: file I/O goes directly between
application and storage, bypassing operating
system read and write caches
• Disabling automatic metadata cache flushing
− In experiments, direct I/O combined with metadata
cache disabling improved I/O speed by about 2x.
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
46
47. Three “remote access” projects
• HDF5-OPeNDAP handler
− See talk by Kent Yang: “HDF5 OPeNDAP project
update and demo”
• HDF5-iRODS integration
− See Peter Cao’s talk Thursday: “HDF5 iRODS”
• Accessing HDF5 through SSHFS-FUSE
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
49
48. Accessing HDF5 through SSHFS-FUSE
• Access to files on remote NFS system limited
• Combining FUSE (Filesystem in Userspace) with SSHFS
(Secure Shell File System)
− FUSE provides application with local view of remote file system
• Another way to mount remote file system
− SSHFS allows the local file system to access parts of remote
file.
• e.g., “read” operation on the remote filesystem can be served
through SSH
• Subsetting can be efficiently done with SSHFS
• Extract a dataset (5 MB) from a 96 MB HDF5 file
− Download whole file + subset locally: 9.85 seconds
− Subset with SSHFS: 0.47 seconds
• Technical report in the works
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
50
49. HDF4 Layout Map Project
• Problem
− Long-term readability of HDF data dependent on
long-term availability of HDF software
• Proposed solution
− Create a map of the layout of data objects in an
HDF file, allowing a simple reader to be written to
access the data
• See today’s talk by Folk and Duerr: “Ensuring
Long Term Access to Remotely Sensed HDF4
Data with Layout Maps.”
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
51
50. HDF and .NET Framework
• Prototype .NET wrappers for HDF5 1.8.0
− Based on subset of HDF5 C routines
• Released in March, 2008
• Unsupported
− Considerable interest, but currently no funding to
support or maintain
− Use hdf-forum email list for questions
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
52
52. Investigation of HDF
Support in Some Open
Source Software Packages
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
54
53. Five open source packages
•
PyHDF
− Python interface to HDF4
− http://pysclint.sourceforge.net/pyhdf/
•
Geospatial Data Abstraction Library (GDAL)
− Translator library for Raster Geospatial Data Formats
− Supports about 100 file formats
− http://gdal.org/
•
NCAR Common Language (NCL)
− Interpreted Language for Data Analysis and Visualization
− http://ncl.ucar.edu/
•
Grid Analysis and Display System (GrADS)
− Interpreted Language for Data Analysis and Visualization
− http://iges.org/grads/
•
GNU Data Language (GDL)
− Interpreted Language for Data Analysis and Visualization
− Data Analysis and Visualization
− http://gnudatalanguage.sourceforge.net/
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
55
54. Evaluation criteria
• Formats
− HDF4, HDF5, netCDF
− Objects supported in each language
• Installation
− Availability of binaries
− Other requirements
• Adequacy of documentation
• Technical report available soon.
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
56
56. Maintenance & Testing with VMWare
•
•
•
•
•
Multiple virtual machines run in parallel
Only relevant software installed
Each represents a supported configuration
Run nightly tests of HDF4, HDF5
Each is powered on, tested, cleaned
automatically
• Technical report available soon.
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
58
57. HDF5 Data Transform Pilot Study
• Tools for Flight Test Data
• Framework to define and apply transformations
to data being read
• Transformations specified in Python
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
59
58. Science Data Stewardship
• Goal: migrate data to a single standards-based archive
format.
• Approach: investigate how to store NASA ECS data and
metadata in HDF5 Archival Information Packages (AIP).
• See talk by Yang, Duerr et al: “Using HDF5 Archive
Information Package to preserve HDF-EOS2 data”
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
60
60. Acknowledgements
This report is based upon work supported in part
by a Cooperative Agreement with the National
Aeronautics and Space Administration (NASA)
under NASA Awards NNX06AC83A and
NNX08AO77A.
Any opinions, findings, and conclusions or
recommendations expressed in this material are
those of the author(s) and do not necessarily
reflect the views of the National Aeronautics and
Space Administration.
Oct. 16, 2008
HDF and HDF-EOS Workshop XII
62
Mike Folk, president and CEO of The HDF Group, comments: “Our close collaboration with NASA’s EOS program has been a model for the kind of partnership we strive to establish with all HDF customers—one where the users depend on and support a quality product and influence future developments. We are pleased and proud to continue our work with NASA in their mission to serve the Earth Science community.”
Why
Increasing need for support, services, quick response
Not a good model for a University R&D project
Who
11 software engineers and several students: develop, maintain HDF software, work on special projects, manage projects
3 tech support staff: helpdesk, doc, sysadmin.
Management team
President
Director of Technical Services and Operations
Director of Software Development
Director of Business Operations
Managers responsible for tools, applications
Other THG staff include seven full-time software engineers who develop and maintain the HDF software, as well as working on special projects, and three technical support staff who provide helpdesk support, documentation, and system administration. The HDF group also generally employs students from the University Computer Science and Engineering departments.
The R&D mission
Maintain and evolve HDF for high end science apps
Maintain HDF4 and HDF5 and tools at supercomputing centers, TeraGrid
Support academic science
Cutting edge data management research
Adapt to leading edge, experimental architectures
Integrate with new middleware technologies, parallel file systems
The “Support and Sustain” mission
Maintain, evolve for communities, sponsors
Provide proprietary consulting, tuning, development
Sustain for long term, maintain data access over time
Goal: help HDF users who rely on IDL get timely access to improved HDF libraries.
The HDF Group and ITTVIS collaborate to improve the process of integrating the new versions of HDF with IDL.
ITT VIS has provided the HDF Group with IDL software and licenses.
Also let us enable IDL clients to access HDF5 files on remote servers via OPeNDAP.
*ITT Visual Information Solutions (makers of IDL and ENVI).
Please mention here that HDF5 maintenance releases are on a half year basis and HDF4 maintenance releases are on yearly basis, i.e., next maintenance release of HDF5 1.6 and 1.8 will be May 2009, and HDF4 in November 2009
Possible performance improvements in 1.10.
Free space management (non-persisting; persisting possible, not certain)
Revised chunking
Fast append
? From Quincey
Testing not only helps find bugs inside HDF library but also finds bugs in EOS test programs
Approach: Metadata Journaling
When an HDF5 file is opened with Metadata Journaling enabled, a companion Journal file is created.
When an HDF5 API function that modifies metadata is completed, a transaction is recorded in the Journal file.
If the application crashes, a recovery program can replay the journal by applying in order all metadata writes until the end of the last completed transaction written to the journal file.
Serial HDF5 with synchronous write mode
Finalize User interface definitions and file format
Serial HDF5 with asynchronous write mode
To mprove Journal file write speed
More features (need funding)
Make raw data operations atomic
Allow "super‐transactions" to be created by applications
Enable journaling for Parallel HDF5
Is it only limited for unlimited / chunked datasets? Or is it that way for all but we’re just fixing it for limited / unchunked cases?
Contrasts with B-tree index:
- B-tree has O(log n) extend, shrink and lookup of chunks
- B-tree has ~logarithmic # of metadata I/O operations as chunks appended
Will be optimizing chunked dataset indexing for datasets with no unlimited dimensions (with array index) and multiple unlimited dimensions (with v2 B-tree) as part of project in the next year also.
Say why external links are useful.
Direct I/O is a feature of the file system whereby file reads and writes go directly from the applications to the storage device, bypassing the operating system read and write caches. Direct I/O is used by only a few applications that manage their own caches, such as databases.