Efficient Support for MPI-I/O Atomicity

Efficient Support for MPI-I/O Atomicity
Based on Versioning

Viet-Trung Tran1, Bogdan Nicolae2, Gabriel Antoniu2, Luc Bougé1
KerData Research Team

1
ENS Cachan, IRISA, France
2
INRIA, IRISA, Rennes, France

1

Context: Data Intensive Large-scale HPC
Simulations
 Large-scale simulations of natural phenomena
 Highly parallel platform
 I/O challenges
 High I/O performance
 Huge data sizes (~PB)
 Highly concurrency

2

Data Access Pattern

 Spatial splitting in parallelization
 Ghost cells
 Application data model vs storage model


•Sequence of bytes

 Concurrent overlapping non-contiguous I/O
 Require atomicity guarantees

3

Goal:

High throughput non-contiguous I/O
under atomicity guarantees

4

State of The Art

 Locking-based approaches to ensure atomicity
 3 level of implementations
 Application
 MPI-I/O Application (Visit, Tornado simulation)
 Storage
Data model (HDF5, NetCDF)

MPI-IO middleware

Parallel file systems (PVFS, GPFS, Lustre)

5

Our Approach

 Dedicated interface for atomic non-contiguous I/O
 Provide atomicity guarantees at storage level
 No need to translate MPI consistency to storage consistency model

 Shadowing as a key to enhance data access under concurrency
 No locking
 Concurrent overlapped writes are allowed
 Atomicity guarantees

 Data striping

6

Building Block: BlobSeer

 A KerData project (blobseer.gforge.inria.fr)
 Data striping
 Versioning-based concurrency control
 Distributed metadata management

7

Building Block: BlobSeer (continued)

 Distributed metadata management
 Organized as a segment tree [0, 8]

 Distributed over a DHT
[0, 4] [0, 4] [4, 4]
 Two phases I/O Metadata trees
 Data access
[0, 2] [0, 2] [2, 2] [2, 2] [4, 2]
 Metadata access

[0, 1] [1, 1] [1, 1] [2, 1] [2, 1] [3, 1] [4, 1]

Blob

8

Proposal for A Non-contiguous,
Versioning Oriented Access Interface

 Non-contiguous Write
 vw = NONCONT_WRITE(id, buffers[], offsets[], sizes[])

 Non-contiguous Read
 NONCONT_READ(id, v, buffers[], offsets[], sizes[])

 Challenges
 Noncontiguous I/O must be atomic
 Efficient under concurrency

9

1st challenge: Non-contiguous I/O Must Be Atomic

 Shadowing techniques
 Isolate non-contiguous update into one single consistent snapshot
 Done at metadata level

10

2nd challenge: Efficiency Under Concurrent Accesses

 Advantages of Shadowing
Our Locking-
 Parallel data I/O phases approach based
approach
 Parallel Metadata I/O
Overlapping Parallel No
phases ? Data I/O

11

Minimize Ordering Overhead

 Ordering is done at metadata level
 Arbitrary order

12

Avoid Synchronization for Concurrent Segment Tree
Generation
 Delegate the generation of shadowing tree to client side
 Shadowing tree are generated in parallel thank to predictable
metadata node ID

13

Lazy Evaluation During Border Node Calculation

 Building metadata tree in bottom-up fashion
 Optimized for non-contiguous pattern

14

Sumary: Overlapping Non-contiguous I/O

Our approach Locking-based
approaches
Data I/O phases Parallel Serialization
Metadata I/O phases Close to parallel thanks to Serialization
1- Arbitrary ordering
2- Metadata level’s ordering
3- Client side’s shadowing in parallel
4- Lazy evaluation

15

Leveraging Our Versioning-Oriented Interface in
Parallel I/O Stack

Application (Visit, Tornado simulation)


MPI-IO middleware

Storage optimized for atomic MPI-I/O

Integrating BlobSeer to MPI-I/O middleware is straightforward

16

Experimental Evaluation

• Our machines: Reservation on Grid'5000 platform
– 80 nodes
– Pentium-4 CPU@2.6Ghz, 4GB RAM, Gigabit Ethernet
– Measured bandwidth: 117.5 MB/s for MTU=1500B
• 3 sets of experiments:
– Scalability of non-contiguous I/O
– Scalability under concurrency
– MPI-tile-I/O

17

Scalability of Non-contiguous I/O

18

Scalability Under Concurrency

19

MPI-tile-I/O: 128 KB Chunk Size

20

MPI-tile-IO: 1MB Chunk Size

21

Conclusion

• Experiments show promising results
• We outperform locking-based approaches
• Key features: shadowing, dedicated API for atomic non-contiguous I/O
• Comparison to Lustre file system

• High throughput non-contiguous I/O under atomicity guarantees
• Future work
• Exposing versioning-interface to MPI-I/O applications
• Potential improvement for producer-consumer workflow
• Pyramid: A large-scale array-oriented active storage system

22

Context

Application (Visit, Tornado simulation)


MPI-IO middleware

Parallel file systems

•Parallel file systems do not provide atomic non-contiguous I/O interface

23

2nd challenge: Efficiency under concurrent
accesses
 Minimize ordering overhead
 Ordering is done at metadata level
 Arbitrary order

 Avoid synchronization for concurrent segment tree generation
 Delegate the generation of shadowing tree to client side
 Shadowing tree are generated in parallel

 Lazy evaluation during border node calculation

24

Efficient Support for MPI-I/O Atomicity

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (7)

Destaque

Destaque (18)

Semelhante a Efficient Support for MPI-I/O Atomicity

Semelhante a Efficient Support for MPI-I/O Atomicity (20)

Mais de Viet-Trung TRAN

Mais de Viet-Trung TRAN (20)

Último

Último (20)

Efficient Support for MPI-I/O Atomicity