4.18.24 Movement Legacies, Reflection, and Review.pptx
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
1. SciMATE: A Novel MapReduce-Like
Framework for Multiple Scientific
Data Formats
Speaker: LIN Qian
http://www.comp.nus.edu.sg/~linqian
2. Scientific data analysis today
• Increasingly data-intensive
– Volume approximately doubles each year
• Stored in certain specialized formats
– NetCDF, HDF5, ADIOS ...
• Popularity of MapReduce and its variants
– Free accessibility
– Easy programmability
– Good scalability
– Built-in fault tolerance
1
5. Scientific data analysis today (cont.)
• “Store-first-analyze-after”
– Reload data in another file system
E.g. load data from PVFS to HDFS
– Reload data in another data format
E.g. load NetCDF/HDF5 data to a specific
format
• Problems
– Long data migration/transformation time
– Stressing network and disks
4
6. SciMATE
• In-situ scientific data analysis
– MapReduce with AlternaTE API
– Supporting NetCDF, HDF5 and flat-files
oNo data reloading!
– Transparent to app developers
• Optimized for
– Access strategies
– Access patterns
5
9. Integrating a new data format
• Data adaption layer is customizable
– Third-party adapter
– Open for extension but closed for
modification
• Have to implement the generic block
loader interface
– Partitioning function and auxiliary
functions
– Data access functions
8
10. Data access strategies and patterns
• full_read()
– too expensive for reading small data
subsets
• partial_read()
– Strided pattern
o partial_read_by_block()
– Column pattern
o partial_read_by_column()
– Discrete point pattern
o partial_read_by_list()
9
11. Access Pattern Optimization
• Strided pattern
– directly supported by API
• Discrete point pattern
– no optimization
• Column pattern
– fixed-size column read 1 2 3 4 5
– contiguous column read 1 2
10
12. Evaluation
• System functionality and scalability
– 16 GB datasets
– Data processing times
ok-means, PCA, kNN
othread scalability, node scalability
– Data loading times
ok-means, PCA
onode scalability
• Partial read vs. Full read
• Fixed-size column read vs. Contiguous column
read
11
18. Conclusion and Future Work
• Conclusion
– Avoid bulk data transfers and vast data
transformation
– Provide a customizable data format
adaption API
– Support optimized read via access
strategies & patterns
• Future Work
– Compare with SciHadoop
17