We consider the challenge of building data management systems that meet an important requirement of today's data-intensive HPC applications: to provide a high I/O throughput while supporting highly concurrent data accesses. In this context, many applications rely on MPI-IO and require atomic, non-contiguous I/O operations that concurrently access shared data. In most existing implementations the atomicity requirement is often implemented through locking-based schemes, which have proven inefficient, especially for non-contiguous I/O. We claim that using a versioning-enabled storage backend has the potential to avoid expensive synchronization as exhibited by locking-based schemes, which is much more efficient. We describe a prototype implementation on top of ROMIO along this idea, and report on promising experimental results with standard MPI-IO benchmarks specifically designed to evaluate the performance of non-contiguous, overlapped I/O accesses under MPI atomicity guarantees.
1. Efficient Support for MPI-I/O Atomicity
Based on Versioning
Viet-Trung Tran1, Bogdan Nicolae2, Gabriel Antoniu2, Luc Bougé1
KerData Research Team
1
ENS Cachan, IRISA, France
2
INRIA, IRISA, Rennes, France
1
2. Context: Data Intensive Large-scale HPC
Simulations
Large-scale simulations of natural phenomena
Highly parallel platform
I/O challenges
High I/O performance
Huge data sizes (~PB)
Highly concurrency
2
3. Data Access Pattern
Spatial splitting in parallelization
Ghost cells
Application data model vs storage model
•Sequence of bytes
Concurrent overlapping non-contiguous I/O
Require atomicity guarantees
3
5. State of The Art
Locking-based approaches to ensure atomicity
3 level of implementations
Application
MPI-I/O Application (Visit, Tornado simulation)
Storage
Data model (HDF5, NetCDF)
MPI-IO middleware
Parallel file systems (PVFS, GPFS, Lustre)
5
6. Our Approach
Dedicated interface for atomic non-contiguous I/O
Provide atomicity guarantees at storage level
No need to translate MPI consistency to storage consistency model
Shadowing as a key to enhance data access under concurrency
No locking
Concurrent overlapped writes are allowed
Atomicity guarantees
Data striping
6
7. Building Block: BlobSeer
A KerData project (blobseer.gforge.inria.fr)
Data striping
Versioning-based concurrency control
Distributed metadata management
7
8. Building Block: BlobSeer (continued)
Distributed metadata management
Organized as a segment tree [0, 8]
Distributed over a DHT
[0, 4] [0, 4] [4, 4]
Two phases I/O Metadata trees
Data access
[0, 2] [0, 2] [2, 2] [2, 2] [4, 2]
Metadata access
[0, 1] [1, 1] [1, 1] [2, 1] [2, 1] [3, 1] [4, 1]
Blob
8
9. Proposal for A Non-contiguous,
Versioning Oriented Access Interface
Non-contiguous Write
vw = NONCONT_WRITE(id, buffers[], offsets[], sizes[])
Non-contiguous Read
NONCONT_READ(id, v, buffers[], offsets[], sizes[])
Challenges
Noncontiguous I/O must be atomic
Efficient under concurrency
9
10. 1st challenge: Non-contiguous I/O Must Be Atomic
Shadowing techniques
Isolate non-contiguous update into one single consistent snapshot
Done at metadata level
10
11. 2nd challenge: Efficiency Under Concurrent Accesses
Advantages of Shadowing
Our Locking-
Parallel data I/O phases approach based
approach
Parallel Metadata I/O
Overlapping Parallel No
phases ? Data I/O
11
13. Avoid Synchronization for Concurrent Segment Tree
Generation
Delegate the generation of shadowing tree to client side
Shadowing tree are generated in parallel thank to predictable
metadata node ID
13
14. Lazy Evaluation During Border Node Calculation
Building metadata tree in bottom-up fashion
Optimized for non-contiguous pattern
14
15. Sumary: Overlapping Non-contiguous I/O
Our approach Locking-based
approaches
Data I/O phases Parallel Serialization
Metadata I/O phases Close to parallel thanks to Serialization
1- Arbitrary ordering
2- Metadata level’s ordering
3- Client side’s shadowing in parallel
4- Lazy evaluation
15
16. Leveraging Our Versioning-Oriented Interface in
Parallel I/O Stack
Application (Visit, Tornado simulation)
Data model (HDF5, NetCDF)
MPI-IO middleware
Storage optimized for atomic MPI-I/O
Integrating BlobSeer to MPI-I/O middleware is straightforward
16
17. Experimental Evaluation
• Our machines: Reservation on Grid'5000 platform
– 80 nodes
– Pentium-4 CPU@2.6Ghz, 4GB RAM, Gigabit Ethernet
– Measured bandwidth: 117.5 MB/s for MTU=1500B
• 3 sets of experiments:
– Scalability of non-contiguous I/O
– Scalability under concurrency
– MPI-tile-I/O
17
22. Conclusion
• Experiments show promising results
• We outperform locking-based approaches
• Key features: shadowing, dedicated API for atomic non-contiguous I/O
• Comparison to Lustre file system
• High throughput non-contiguous I/O under atomicity guarantees
• Future work
• Exposing versioning-interface to MPI-I/O applications
• Potential improvement for producer-consumer workflow
• Pyramid: A large-scale array-oriented active storage system
22
23. Context
Application (Visit, Tornado simulation)
Data model (HDF5, NetCDF)
MPI-IO middleware
Parallel file systems
•Parallel file systems do not provide atomic non-contiguous I/O interface
23
24. 2nd challenge: Efficiency under concurrent
accesses
Minimize ordering overhead
Ordering is done at metadata level
Arbitrary order
Avoid synchronization for concurrent segment tree generation
Delegate the generation of shadowing tree to client side
Shadowing tree are generated in parallel
Lazy evaluation during border node calculation
24