SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

•Transferir como PPTX, PDF•

1 gostou•1,147 visualizações

Qian Lin

Educação

SciMATE: A Novel MapReduce-Like
Framework for Multiple Scientific
Data Formats

Speaker: LIN Qian
http://www.comp.nus.edu.sg/~linqian

Scientific data analysis today
• Increasingly data-intensive
– Volume approximately doubles each year
• Stored in certain specialized formats
– NetCDF, HDF5, ADIOS ...
• Popularity of MapReduce and its variants
– Free accessibility
– Easy programmability
– Good scalability
– Built-in fault tolerance
1

Scientific data analysis today (cont.)
• “Store-first-analyze-after”
– Reload data in another file system
E.g. load data from PVFS to HDFS
– Reload data in another data format
E.g. load NetCDF/HDF5 data to a specific
format
• Problems
– Long data migration/transformation time
– Stressing network and disks
4

SciMATE
• In-situ scientific data analysis
– MapReduce with AlternaTE API
– Supporting NetCDF, HDF5 and flat-files
oNo data reloading!
– Transparent to app developers

• Optimized for
– Access strategies
– Access patterns
5

Scientific Data Processing Module

Runtime
System

Integrating a new data format
• Data adaption layer is customizable
– Third-party adapter
– Open for extension but closed for
modification
• Have to implement the generic block
loader interface
– Partitioning function and auxiliary
functions
– Data access functions
8

Data access strategies and patterns
• full_read()
– too expensive for reading small data
subsets
• partial_read()
– Strided pattern
o partial_read_by_block()
– Column pattern
o partial_read_by_column()
– Discrete point pattern
o partial_read_by_list()
9

Access Pattern Optimization
• Strided pattern
– directly supported by API
• Discrete point pattern
– no optimization
• Column pattern
– fixed-size column read 1 2 3 4 5

– contiguous column read 1 2

10

Evaluation
• System functionality and scalability
– 16 GB datasets
– Data processing times
ok-means, PCA, kNN
othread scalability, node scalability
– Data loading times
ok-means, PCA
onode scalability
• Partial read vs. Full read
• Fixed-size column read vs. Contiguous column
read
11

Fixed-size column read vs. Contiguous column read

NetCDF HDF5

Contiguous column read

NetCDF shows better column non-contiguity tolerance than HDF5.
16

Conclusion and Future Work
• Conclusion
– Avoid bulk data transfers and vast data
transformation
– Provide a customizable data format
adaption API
– Support optimized read via access
strategies & patterns
• Future Work
– Compare with SciHadoop
17

Mais conteúdo relacionado

Mais procurados

Hadoop and MapReduceamreshkr19

Data warehouse 11 introduction to data transformationVaibhav Khanna

From Backups To Time Travel: A Systems Perspective on SnapshotsNuoDB

Online Analytical Processingnayakslideshare

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon

HDF5 High Level and Lite LibrariesThe HDF-EOS Tools and Information Center

2 bda module-2 apache hiveYashaswiniAS1

Open-source Scientific Computing and Data Analytics using HDFThe HDF-EOS Tools and Information Center

BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon

Generalized Conversion of HDF-EOS Products to GIS-Compatible FormatsThe HDF-EOS Tools and Information Center

Product Designer Hub - Taking HPD to the WebThe HDF-EOS Tools and Information Center

Hierarchical Data Formats (HDF) UpdateThe HDF-EOS Tools and Information Center

MATLAB Modernization on HDF5 1.10The HDF-EOS Tools and Information Center

HadoopKasam Sharif

HDF UpdateThe HDF-EOS Tools and Information Center

HDF5 and Ecosystem: What Is New?The HDF-EOS Tools and Information Center

ODI11g, Hadoop and "Big Data" SourcesMark Rittman

HDF Product Designer: Using Templates to Achieve InteroperabilityThe HDF-EOS Tools and Information Center

Indexing HDF5: A SurveyThe HDF-EOS Tools and Information Center

Big data and hadoop anupamaAnupama Prabhudesai

Mais procurados (20)

Hadoop and MapReduce

Data warehouse 11 introduction to data transformation

From Backups To Time Travel: A Systems Perspective on Snapshots

Online Analytical Processing

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

HDF5 High Level and Lite Libraries

2 bda module-2 apache hive

Open-source Scientific Computing and Data Analytics using HDF

BDM8 - Near-realtime Big Data Analytics using Impala

Generalized Conversion of HDF-EOS Products to GIS-Compatible Formats

Product Designer Hub - Taking HPD to the Web

Hierarchical Data Formats (HDF) Update

MATLAB Modernization on HDF5 1.10

Hadoop

HDF Update

HDF5 and Ecosystem: What Is New?

ODI11g, Hadoop and "Big Data" Sources

HDF Product Designer: Using Templates to Achieve Interoperability

Indexing HDF5: A Survey

Big data and hadoop anupama

Semelhante a SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

HDF UpdateThe HDF-EOS Tools and Information Center

Drill architecture 20120913jasonfrantz

Big data Hadoop Ayyappan Paramesh

Plans for Enhanced NetCDF-4 Interface to HDF5 DataThe HDF-EOS Tools and Information Center

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin

Hadoop ppt1chariorienit

Tim Pugh-SPEDDEXES 2014aceas13tern

Hoodie - DataEngConf 2017Vinoth Chandar

Big Data Architecture Workshop - Vahid Amiridatastack

Apache Spark sqlaftab alam

HDFCloud Workshop: HDF5 in the CloudThe HDF-EOS Tools and Information Center

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com

Navigating the World of User Data Management and Data DiscoveryDataWorks Summit/Hadoop Summit

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Hadoop.pptxarslanhaneef

Hadoop.pptxsonukumar379092

HDFThe HDF-EOS Tools and Information Center

MongoDB Capacity PlanningNorberto Leite

Accessing HDF5 data in the cloud with HSDSThe HDF-EOS Tools and Information Center

Data management for Quantitative Biology -Basics and challenges in biomedical...QBiC_Tue

Semelhante a SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats (20)

HDF Update

Drill architecture 20120913

Big data Hadoop

Plans for Enhanced NetCDF-4 Interface to HDF5 Data

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...

Hadoop ppt1

Tim Pugh-SPEDDEXES 2014

Hoodie - DataEngConf 2017

Big Data Architecture Workshop - Vahid Amiri

Apache Spark sql

HDFCloud Workshop: HDF5 in the Cloud

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...

Navigating the World of User Data Management and Data Discovery

List of Engineering Colleges in Uttarakhand

Hadoop.pptx

HDF

MongoDB Capacity Planning

Accessing HDF5 data in the cloud with HSDS

Data management for Quantitative Biology -Basics and challenges in biomedical...

Mais de Qian Lin

Fine-Grained, Secure and Efficient Data Provenance on Blockchain SystemsQian Lin

PaxosStore: High-availability Storage Made Practical in WeChatQian Lin

Trinity: A Distributed Graph Engine on a Memory CloudQian Lin

Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesQian Lin

Adaptive Execution Support for Malleable ComputationQian Lin

C-Cube: Elastic Continuous Clustering in the CloudQian Lin

Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldQian Lin

Optimizing Virtual Machines Using Hybrid VirtualizationQian Lin

Virtual Machine PerformanceQian Lin

Be an Explorer, Be a Coder, Be a WriterQian Lin

In-situ MapReduce for Log ProcessingQian Lin

C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsQian Lin

Mais de Qian Lin (12)

Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems

PaxosStore: High-availability Storage Made Practical in WeChat

Trinity: A Distributed Graph Engine on a Memory Cloud

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Adaptive Execution Support for Malleable Computation

C-Cube: Elastic Continuous Clustering in the Cloud

Kineograph: Taking the Pulse of a Fast-Changing and Connected World

Optimizing Virtual Machines Using Hybrid Virtualization

Virtual Machine Performance

Be an Explorer, Be a Coder, Be a Writer

In-situ MapReduce for Log Processing

C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors

Último

Proudly South Africa powerpoint Thorisha.pptxthorishapillay1

Field Attribute Index Feature in Odoo 17Celine George

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup

Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George

ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1

Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri

Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood

FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy

Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood

4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

1. SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian

2. Scientific data analysis today • Increasingly data-intensive – Volume approximately doubles each year • Stored in certain specialized formats – NetCDF, HDF5, ADIOS ... • Popularity of MapReduce and its variants – Free accessibility – Easy programmability – Good scalability – Built-in fault tolerance 1

3. NetCDF • Network Common Data Form 2

4. HDF5 • Hierarchical Data Format 3

5. Scientific data analysis today (cont.) • “Store-first-analyze-after” – Reload data in another file system E.g. load data from PVFS to HDFS – Reload data in another data format E.g. load NetCDF/HDF5 data to a specific format • Problems – Long data migration/transformation time – Stressing network and disks 4

6. SciMATE • In-situ scientific data analysis – MapReduce with AlternaTE API – Supporting NetCDF, HDF5 and flat-files oNo data reloading! – Transparent to app developers • Optimized for – Access strategies – Access patterns 5

7. System overview 6

8. Scientific Data Processing Module Runtime System

9. Integrating a new data format • Data adaption layer is customizable – Third-party adapter – Open for extension but closed for modification • Have to implement the generic block loader interface – Partitioning function and auxiliary functions – Data access functions 8

10. Data access strategies and patterns • full_read() – too expensive for reading small data subsets • partial_read() – Strided pattern o partial_read_by_block() – Column pattern o partial_read_by_column() – Discrete point pattern o partial_read_by_list() 9

11. Access Pattern Optimization • Strided pattern – directly supported by API • Discrete point pattern – no optimization • Column pattern – fixed-size column read 1 2 3 4 5 – contiguous column read 1 2 10

12. Evaluation • System functionality and scalability – 16 GB datasets – Data processing times ok-means, PCA, kNN othread scalability, node scalability – Data loading times ok-means, PCA onode scalability • Partial read vs. Full read • Fixed-size column read vs. Contiguous column read 11

13. Thread scalability

14. Node scalability (data processing)

15. Node scalability (data loading)

16. Fixed-size column read vs. Contiguous column read NetCDF HDF5

17. Contiguous column read NetCDF shows better column non-contiguity tolerance than HDF5. 16

18. Conclusion and Future Work • Conclusion – Avoid bulk data transfers and vast data transformation – Provide a customizable data format adaption API – Support optimized read via access strategies & patterns • Future Work – Compare with SciHadoop 17

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

Semelhante a SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats (20)

Mais de Qian Lin

Mais de Qian Lin (12)

Último

Último (20)

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats