SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Reverse Time Migration via Resilient
Distributed Datasets: Towards In-Memory
Coherence of Seismic-Reflection Wavefields
using Apache Spark
Ian Lumb
HPCS 2015 - Montreal
http://hpcs.ca
Outline
● The challenges and opportunities of RTM
● Refactoring RTM with Spark/RDDs
o Spark’ing coherence between wavefields
● Summary
http://www.acceleware.com/technical-papers
Zhou 2014
Fig. 7.25
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark
Motivation
● RTM is performance-challenged
o Algorithms research remains topical
 GPUs responsible for compelling results
● Revisit RTM as a ‘Big Data problem’
o In-memory analytics has the potential to
 Improve performance of data and wavefield
manipulations in concert with computations
 Introduce new prospects for imaging conditions
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark
Key Performance Challenges
● RTM modeling kernel is compute intensive
o Stable, non-dispersive solution via FDM requires
 Small time steps and small grid intervals
 Higher-order approximations of the spatial
derivatives
● RTM wavefields exceed memory capacity
o Multiple-TB source volumes must be stored to disk
e.g., Liu et al., Computers & Geosciences 59 (2013) 17–23
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing
● Fault-tolerant, parallel data structures
o Cluster-ready
● Optionally persistent
● Can be partitioned for optimal placement
● Manipulated via operators
Zaharia et al., NSDI 2012
RTM via RDDs: Implementation using Spark
● Apache Spark is an implementation of RDDs
● Make use of HDFS or alternative FS
o GPFS, AWS S3, OpenStack Swift, Ceph or Lustre
● Choose appropriate programming model(s)
o Not limited to MapReduce
o Iterative and/or interactive (including streaming)
● Manage Spark workloads
o Built-in mode or YARN mode, Mesos
o Univa Universal Resource Brokerafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons-
apache-spark-hot/
RTM via RDDs: Implementation using Spark (2)
● Deployable on bare metal … clouds
o Monitoring/management Bright Cluster Manager
● Introduces analytics possibilities for RTM
o Program in Java (C/C++ via JNA), Scala or Python
● Uptake is significant - rapidly growing community
● Results are extremely impressive
o Exploit CPUs and/or GPUsafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons-
apache-spark-hot/
RTM via RDDs: Opportunities
● Apply RDDs to gathers of seismic data
o Partition RDDs optimally for wavefields calculations
● Apply RDDs to source wavefields
o Partition RDDs optimally for cross-correlation of
forward and reverse time wavefields
 Significantly reduce/eliminate disk I/O
● Investigate alternate imaging conditions
o Machine-learning and/or graph-analytics algorithms
in addition to cross-correlation
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark
Spark
Workers
Spark (YARN) Master
Spark
or YARN
http://www.informationweek.com/big-data/big-data-analytics/apache-spark-3-
promising-use-cases/a/d-id/1319660
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark
http://ipython.org/notebook.html
Thunder: Initial Impressions
● Written in Spark's Python API (Pyspark)
o Makes use of scipy, numpy, and scikit-learn
● IPython Notebook serves as interactive GUI
 Runs in a Web browser
 Notebooks can include text and graphics
 Secure, remote access to an in-cluster IPython
Notebook server
● Includes modular functions for time-series analysis
● Can interface with C/C++ from Python
http://thunder-project.org/
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark
Is there a case for migration?
● In-memory computing via RDDs is promising
o Application to gathers and wavefields
● Spark provides analytics upside
o Imaging conditions other than cross-correlation
● Spark may be applicable to modeling kernels
● Spark can be easily incorporated into pre-existing IT
infrastructures
o Compliments existing HPC environments
http://rice2015oghpc.rice.edu/technical-program/
Summary
● Is there a case for migration?
o From: RTM via HPC
o To: RTM via Big Data or ( Big Data and HPC )
● Does it make sense to refactor other HPC
problems as ‘Big Data problems’?
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing
● Fault-tolerant, parallel data structures
o Cluster-ready
● Optionally persistent
● Can be partitioned for optimal placement
● Manipulated via operators
Zaharia et al., NSDI 2012
Refactoring HPC with Spark/RDDs …
● Could Spark/RDDs replace MPI?
o Spark has primitives for distributed in-memory
parallel computing … including fault tolerance
Acknowledgements
● M. Zaharia et al. for RDDs
● Communities responsible for Spark, Python & Thunder
● M. Lamarca, P. Labropoulos, D. Shestakov & L.
Gibbons at Bright Computing
Questions?
Ian Lumb
ianlumb@yorku.ca
ian.lumb@brightcomputing.com
Resources
● RTM's scientific context
● Spark support in Bright Cluster Manager for
Apache Hadoop

Mais conteúdo relacionado

Mais de Ian Lumb

Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro ServiceDrilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro ServiceIan Lumb
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...Ian Lumb
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Ian Lumb
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?Ian Lumb
 
VoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache SparkVoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache SparkIan Lumb
 
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...Ian Lumb
 
Utilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster ManagerUtilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster ManagerIan Lumb
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeIan Lumb
 
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Ian Lumb
 

Mais de Ian Lumb (9)

Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro ServiceDrilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
 
VoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache SparkVoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache Spark
 
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
 
Utilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster ManagerUtilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster Manager
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
 
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
 

Último

World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabkiyorndlab
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptSachin Teotia
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxArdeniel
 
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...PirithiRaju
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsHassan Jolany
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxmarwaahmad357
 
Physics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersPhysics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersAndreaLucarelli
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPirithiRaju
 
Pests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPirithiRaju
 
Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxRahulVishwakarma71547
 
biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsSafaFallah
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPirithiRaju
 
Alternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusAlternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusPradnya Wadekar
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docxmarwaahmad357
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba caohsadfeeling
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPirithiRaju
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPirithiRaju
 
Gene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfGene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfNetHelix
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsSumathi Arumugam
 

Último (20)

World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlab
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.ppt
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
 
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbits
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docx
 
Physics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersPhysics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and Engineers
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPR
 
Pests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPR
 
Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptx
 
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
 
biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibiotics
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPR
 
Alternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusAlternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabus
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docx
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba ca
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
 
Gene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfGene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdf
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery Systems
 

Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark

  • 1. Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory Coherence of Seismic-Reflection Wavefields using Apache Spark Ian Lumb HPCS 2015 - Montreal http://hpcs.ca
  • 2. Outline ● The challenges and opportunities of RTM ● Refactoring RTM with Spark/RDDs o Spark’ing coherence between wavefields ● Summary
  • 6. Motivation ● RTM is performance-challenged o Algorithms research remains topical  GPUs responsible for compelling results ● Revisit RTM as a ‘Big Data problem’ o In-memory analytics has the potential to  Improve performance of data and wavefield manipulations in concert with computations  Introduce new prospects for imaging conditions
  • 8. Key Performance Challenges ● RTM modeling kernel is compute intensive o Stable, non-dispersive solution via FDM requires  Small time steps and small grid intervals  Higher-order approximations of the spatial derivatives ● RTM wavefields exceed memory capacity o Multiple-TB source volumes must be stored to disk e.g., Liu et al., Computers & Geosciences 59 (2013) 17–23
  • 9. Resilient Distributed Datasets (RDDs) ● Abstraction for in-memory computing ● Fault-tolerant, parallel data structures o Cluster-ready ● Optionally persistent ● Can be partitioned for optimal placement ● Manipulated via operators Zaharia et al., NSDI 2012
  • 10. RTM via RDDs: Implementation using Spark ● Apache Spark is an implementation of RDDs ● Make use of HDFS or alternative FS o GPFS, AWS S3, OpenStack Swift, Ceph or Lustre ● Choose appropriate programming model(s) o Not limited to MapReduce o Iterative and/or interactive (including streaming) ● Manage Spark workloads o Built-in mode or YARN mode, Mesos o Univa Universal Resource Brokerafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons- apache-spark-hot/
  • 11. RTM via RDDs: Implementation using Spark (2) ● Deployable on bare metal … clouds o Monitoring/management Bright Cluster Manager ● Introduces analytics possibilities for RTM o Program in Java (C/C++ via JNA), Scala or Python ● Uptake is significant - rapidly growing community ● Results are extremely impressive o Exploit CPUs and/or GPUsafter Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons- apache-spark-hot/
  • 12. RTM via RDDs: Opportunities ● Apply RDDs to gathers of seismic data o Partition RDDs optimally for wavefields calculations ● Apply RDDs to source wavefields o Partition RDDs optimally for cross-correlation of forward and reverse time wavefields  Significantly reduce/eliminate disk I/O ● Investigate alternate imaging conditions o Machine-learning and/or graph-analytics algorithms in addition to cross-correlation
  • 18. Thunder: Initial Impressions ● Written in Spark's Python API (Pyspark) o Makes use of scipy, numpy, and scikit-learn ● IPython Notebook serves as interactive GUI  Runs in a Web browser  Notebooks can include text and graphics  Secure, remote access to an in-cluster IPython Notebook server ● Includes modular functions for time-series analysis ● Can interface with C/C++ from Python http://thunder-project.org/
  • 20. Is there a case for migration? ● In-memory computing via RDDs is promising o Application to gathers and wavefields ● Spark provides analytics upside o Imaging conditions other than cross-correlation ● Spark may be applicable to modeling kernels ● Spark can be easily incorporated into pre-existing IT infrastructures o Compliments existing HPC environments http://rice2015oghpc.rice.edu/technical-program/
  • 21. Summary ● Is there a case for migration? o From: RTM via HPC o To: RTM via Big Data or ( Big Data and HPC ) ● Does it make sense to refactor other HPC problems as ‘Big Data problems’?
  • 22. Resilient Distributed Datasets (RDDs) ● Abstraction for in-memory computing ● Fault-tolerant, parallel data structures o Cluster-ready ● Optionally persistent ● Can be partitioned for optimal placement ● Manipulated via operators Zaharia et al., NSDI 2012
  • 23. Refactoring HPC with Spark/RDDs … ● Could Spark/RDDs replace MPI? o Spark has primitives for distributed in-memory parallel computing … including fault tolerance
  • 24. Acknowledgements ● M. Zaharia et al. for RDDs ● Communities responsible for Spark, Python & Thunder ● M. Lamarca, P. Labropoulos, D. Shestakov & L. Gibbons at Bright Computing
  • 26. Resources ● RTM's scientific context ● Spark support in Bright Cluster Manager for Apache Hadoop

Notas do Editor

  1. From HPCS 2015 abstract: “Ultimately, in Reverse Time Seismic Migration (RTM), the coherence between two wavefields is determined across all depth-common gathers (i.e., source-receiver pairings) of seismic-reflection data. Because coherence between the two wavefields minimizes the impact of artifacts in the imaged section (or volume) arising from complex geological structures (e.g., folds, faults, domes, steeply dipping lithological interfaces), seismic-reflection data processed via RTM most-accurately depicts all reflectors in their actual locations in space and time (e.g., Zhou, Practical Seismic Data Analysis, Cambridge University Press, 2014).”
  2. An actual example illustrating the forward and reverse wavefields plus the migrated image. Source of image indicated.
  3. From the abstract for the HPCS 2015 event: “In the classical approach for RTM, forward modeling involving the three-dimensional wave equation (3D-WEM) results in source wavefields that are computed using the Finite Difference Method (FDM), and then stored to disk. In a subsequent step, and on a per-gather basis, source wavefields are read from disk so that they can be cross-correlated with the backwards-propagated (i.e., time-reversed) wavefields corresponding to the receivers - a step that again requires use of the FDM modeling kernel for the 3D-WEM. The inherent requirement for disk I/O involving multiple TB volumes of seismic-reflection data, during the application of the imaging condition (i.e., the cross-correlation step), results in a performance penalty well known to be highly problematical throughout the petroleum-exploration industry.” Flow chart adaptation based on algorithm detailed by Liu et al., Computers & Geosciences 59 (2013) 17–23.
  4. From https://ianlumb.wordpress.com/2015/04/01/possibilities-for-reverse-time-seismic-migration-rtm-using-apache-spark/: “RTM has a storied history of being performance-challenged. Although the method was originally conceived by geophysicists in the 1980s, it was almosttwo decades before it became computationally tractable. Considered table stakes in terms of seismic processing by today’s standards, algorithms research for RTM remains highly topical – not just at Rice, York and other universities, but also at the multinational corporations whose very livelihood depends upon the effective and efficient processing of seismic-reflection data. And of particular note are the consistent gains being made since the introduction of GPU programmability via CUDA, as innovative algorithms for RTM can exploit this platform for double-digit speedups.” From the HPCS abstract: “Over the past decade or so, General Purpose Graphics Processing Units (GPGPUs) have been employed to significantly reduce the processing burden of disk I/O in executing RTM. Broadly speaking, in applying RTM’s imaging condition, algorithms have made effective and efficient use of both the memory hierarchy as well as parallel-processing capabilities inherent in GPGPUs. Despite the progress that has been made, particularly in the implementation of algorithms using CUDA for programming GPGPUs, the computational performance of RTM remains an active area of research that continues to engage academics as well as industry.”
  5. JNA = Java Native Access
  6. From the HPCS 2015 abstract: “Recent work has already indicated that seismic reflection data in accepted industry formats can be distributed in memory across a cluster using Apache Spark (Yan et al., “A Spark-based Seismic Data Analytics Cloud”, 2015 Rice Oil & Gas Workshop, Houston, TX, http://rice2015.og-hpc.org/technical-program/).” “... attention here focuses on use of RDDs for facilitating the assessment of coherence between seismic-reflection wavefields in memory. More specifically an algorithm that significantly reduces the impact of disk I/O, in the wavefield manipulations required by RTM, is proposed based on RDDs and subsequently implementation-prototyped using Apache Spark.”
  7. From the HPCS 2015 abstract: “The need to cross-correlate two wavefields in the application of RTM’s imaging condition remains one of two fundamental challenges with use of the method in practice (e.g., Liu et al., Computers & Geosciences 59, 17–23, 2013). In a significant departure from previous approaches, this computational challenge is addressed here through the introduction of Resilient Distributed Datasets (RDDs) for RTM’s precomputed source wavefields. RDDs are a relatively recent abstraction for in-memory computing ideally suited to distributed computing environments like clusters (Zaharia et al., NSDI 2012, http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). Originally introduced for Big Data Analytics and popularized (e.g., Lumb, “8 Reasons Apache Spark is So Hot”, insideBIGDATA, http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/, 2015) through the open-source implementation known as Apache Spark (https://spark.apache.org/), RDDs also appear promising in recontextualizing RTM’s imaging condition.”
  8. A schematic of the three-tier solution architecture: Client tier - Interactive analysis is facilitated through use of an IPython Notebook running remotely in a Web browser. Implemented via Spark’s Python API, Thunder provides classes that include the ability to analyse time series. App-server tier - Spark itself comprises the bulk of this tier - from its core, to analytics apps (including Thunder), finally to interfaces to a number of external data sources. Worker tier - Spark workers execute tasks generated in interactive analysis/processing sessions involving use of Thunder. Overall, workload is managed by the Spark Master (when Spark runs in a standalone mode), or via Hadoop’s YARN (when YARN mode is in effect).
  9. The app-server tier illustrating Apache Spark and its support for various data sources.
  10. A screenshot from Version 7.1 of Bright Cluster Manager for Apache Hadoop. In this screenshot, it is clear that Bright has deployed Apache Spark in tandem with Apache Hadoop - in other words, Spark makes use of HDFS as its file system, and YARN as its resource negotiator for managing workloads. Bright’s capabilities for monitoring and managing Hadoop, Spark and the physical cluster are enabled through the use of roles that include HDFS and its services, ZooKeeper, YARN as well as Spark. Spark support in Version 7.1 of Bright Cluster Manager for Apache Hadoop: Bright deploys the physical cluster, Hadoop & Spark Includes HDFS, YARN and other data-platform components Bright monitors and manages the physical cluster, Hadoop and Spark 50 metrics specific to Spark plus 650 for Hadoop that compliment 160 metrics for the physical cluster 3 Spark-specific management roles with 15 parameters Bright manages Spark workloads Standalone mode uses Spark’s built-in capability YARN mode uses Hadoop’s capabilities Bright manages Spark with or without Hadoop Spark can make use of HDFS Bright Computing currently investigating HDFS alternatives – e.g., Ceph and Lustre Bright supports Hadoop and Spark Includes monthly updates to ensure the platform is maintained Technical support plus product documentation
  11. In the client tier, an IPython Notebook serves as the GUI for time-series analysis via Spark-enabled Thunder.
  12. Notes: Running an IPython Notebook server - enables remote access via a Web browser to a Spark cluster Interfacing with C/C++ - see, e.g., https://scipy-lectures.github.io/advanced/interfacing_with_c/interfacing_with_c.html
  13. Thunder’s method for cross-correlation within its class for time-series analysis.
  14. This slide was first presented at the 2015 Rice University Oil and Gas HPC Workshop in March 2015.