SlideShare uma empresa Scribd logo
1 de 47
Baixar para ler offline
Dongwook Lee
Applied Mathematics & Statistics
University of California, Santa Cruz
High Performance,
Massively Parallel Computing with
FLASH
AMS 250, June 2, 2015
FLASH Simulation of a 3D Core-collapse Supernova
Courtesy of S. Couch
MIRA, BG/Q, Argonne National Lab
49,152 nodes, 786,432 cores
OMEGA Laser (US)
Parallel Computing in1917
Richardson (1881-1953) introduced a principle of parallel computation using
men power in order to conduct weather predictions
“High-
Performance”
Scientific
Computation
=
HPC in 21st Century
‣ Goal:To solve large problems in science, engineering, or business using multi
physics, multi scale simulations on large scale computing platforms
‣Hardware: Currently petaflops/s (going to exaflops/s), heterogeneous
architectures, inter-node and intra-node parallelism
‣Software: MPI, OpenMP (fine vs. coarse grained), discretizers, partitioners,
solvers and integrators, grid mesh, grid adaptivity, load balancing, UQ, data
compression, config system, compilers, debuggers, profilers, translators,
performance optimizers, I/O, workflow controllers, visualization systems, etc…
‣Hardware & software: concurrent programming models, co-Design, fault
monitoring, etc…
Flop/s Rate Increase since 1988
‣ACM Gordon Bell Prize
‣ 1 Gigaflop/s — structural simulation, 1988
‣ 1 Teraflop/s — compressible turbulence, 1998
‣ 1 Petaflop/s — molecular dynamics simulation, 2008
‣Total of 6 orders of magnitude of improvements for individual
cores over the two decades:
‣ computer engineering (only two orders of magnitude)
‣ concurrency (the rest four orders of magnitude)
High Performance Computing (HPC)
‣This tension between computation & memory brings a
paradigm shift in numerical algorithms for HPC
‣To enable scientific computing on HPC architectures:
▪ efficient parallel computing, (e.g., data parallelism, task
parallelism, MPI, multi-threading, GPU accelerator, etc.)
▪ better numerical algorithms for HPC
Astrophysics Application
Gravitationally Confined Detonation of Ia SN:
Calder et al (2003); Calder & Lamb (2004);Townsley et al (2007); Plewa (2007); Plewa and Kasen
(2007); Jordan et al (2008); Meakin et al (2009); Jordan et al (2012)
https://vimeo.com/40696524
Astrophysics Application
Large-scale FLASH simulations of Buoyancy-driven Turbulent Nuclear Combustion
https://vimeo.com/40691923
In collaboration with
the research teams in
U of Chicago & Oxford Univ.
Laboratory Astrophysics: HEDP
3D Simulation
Aerodynamics: Supersonic Airflow
Theory
Experiment
Simulation
Theory
Experiment
Scientific
Computation
validation
verification
Cyclic Relationship
Scientific Computing Tasks
Science Problem
(IC, BC, ODE/PDE)
Simulator
(code, computer)
Results
(Validation, verification,
analysis)
Roles of Scientific Computing
‣Scientific simulations of multi physics, multi scale
phenomena have enabled us to enhance our scientific
understanding via
‣ advances in modeling and algorithms
‣ growth of computing resources
Challenges in HPC
‣ Maximizing the scientific outcome of simulations,
especially on high-performance computing (HPC)
resources, requires many trade-offs
‣ Scientists, engineers, & software developers are often
challenged to explore previously unexplored regimes in
both physics and computing to maximally gain scientific
utilization of large HPC
Today’s Topics
‣ I am going to present a case study with the FLASH
code on the following topics:
‣ Code architecture
‣ parallelization (MPI & threading)
‣ optimization (algorithm tweaking & improvements)
‣ FLASH performance on HPC
FLASH Code
‣FLASH is is free, open source code for astrophysics and HEDP
(http://flash.uchicago.edu)
▪ modular, multi-physics, adaptive mesh refinement (AMR), parallel
(MPI & OpenMP), finite-volume Eulerian compressible code for
solving hydrodynamics and MHD
▪ professionally software engineered and maintained (daily
regression test suite, code verification/validation), inline/online
documentation
▪ FLASH can run on various platforms from laptops to
supercomputing (peta-scale) systems such as IBM BG/P and BG/Q
with great scaling over a hundred thousands processors
FLASH Code
‣FLASH is is free, open source code for astrophysics and HEDP
(http://flash.uchicago.edu)
▪ modular, multi-physics, adaptive mesh refinement (AMR), parallel
(MPI & OpenMP), finite-volume Eulerian compressible code for
solving hydrodynamics and MHD
▪ professionally software engineered and maintained (daily
regression test suite, code verification/validation), inline/online
documentation
▪ FLASH can run on various platforms from laptops to
supercomputing (peta-scale) systems such as IBM BG/P and BG/Q
with great scaling over a hundred thousands processors
• Over 1.2 million lines of code (Fortran, C, pyton)
• 25% are comments
• Extensive code docs in user’s manual (~ 500 pages)
• Great scaling ~ 100,000 procs
FLASH around the World
Canada
US
GermanyItaly
Canada
Germany
Monthly Notices of the Royal Astronomical Society
Volume 355 Issue 3 Page 995 - December 2004
doi:10.1111/j.1365-2966.2004.08381.x
Quenching cluster cooling flows with recurrent hot plasma bubbles
Claudio Dalla Vecchia1, Richard G. Bower1, Tom Theuns1,2, Michael L.
Balogh1, Pasquale Mazzotta3 and Carlos S. Frenk1
UK
Netherlands
Papers using FLASH
year count
2000 10
2001 7
2002 11
2003 21
2004 46
2005 48
2006 68
2007 72
2008 93
2009 98
2010 93
2011 103
2012 122
2013 12
804
0
20
40
60
80
100
120
140
count
Year
Number of Papers
downloads > 8500; authors > 1500; papers > 1000
Research Applications
‣Major Research Applications
‣thermonuclear flashes
‣high energy density physics (HEDP) - B field amplification
‣fluid-structure interaction
‣star formation
‣star-star & star-planets interactions
‣cosmology
‣galaxy & galaxy cluster simulations
‣turbulence
‣…
Scientific Simulations using FLASH
cosmological
cluster formation supersonic MHD
turbulence
Type Ia SN
RT
CCSN
ram pressure stripping
laser slab
rigid body
structure
Accretion Torus
LULI/Vulcan experiments: B-field
generation/amplification
History
‣1997: Founding of ASCI (Accelerated Strategy Computing
Initiative) FLASH Center
‣ thermonuclear flashes at neutron stars and WD surfaces
‣2000: first FLASH code release
‣ Fryxell, Olson, Ricker,Timmes, Zingale, Lamb, MacNeice, Rosner,
Tururan,Tufo (The FLASH Paper,ApJS)
‣PPM hydro FVM code, no MHD
‣no-self gravity
‣AMR using PARAMESH
History
‣2002: Major revision updated to 2.3
‣MHD solver based on the 8-wave scheme: directionally split,
FVM (Powell, Roe, Linde, Gombosi, De Zeeuw, 1999)
‣self-gravity/Poisson solver (Ricker)
‣tracer particle
‣cosmology
‣HDF5 I/O
History
‣2008: completely restructured version 3
‣unsplit HD & USM-MHD solvers (Dongwook, JCP 2009; 2013)
‣more flexible grid structures (UG & AMR)
‣decentralized “database” structure
‣reorganized directory structure
‣2011:Version 4 (Last released ver. 4.2.2, 2014)
‣full 3D AMR for MHD
‣many new physical modules
‣e.g., implicit diffusion solver for thermal conduction, energy
deposition via laser, 3-temperature for HEDP, solid-boundary
interface
Out-of-Box Examples
Domain Decomposition
‣ Adaptive Mesh Refinement (w/ Paramesh)
▪conventional parallelism via
MPI (Message Passing Interface)
▪domain decomposition distributed over
multiple processor units
▪distributed memory (cf. shared memory)
uniform grid oct-tree-based
block AMR
patch-based AMR
Single block
FLASH Grid Structures
‣Two ways to setup
‣AMR using PARAMESH
(MacNiece et al, 2000)
‣block structured, oct-tree
based AMR
‣alternative CHOMBO
(Colella et al.) library is under
development
‣UG without AMR overhead
FLASH AMR Grid Structures
‣Oct-tree structure where each
branch of tree produces leaf
blocks
‣FLASH evolves only on the leaf
blocks
‣Ref ratio is a factor of 2
‣Mesh can be refined or de-
refined adaptively (customizable)
‣Morton space-filling curve
(Warren and Salmon, 1993) for
load balancing
2d
FLASH AMR Grid Structures
‣Oct-tree structure where each
branch of tree produces leaf
blocks
‣FLASH evolves only on the leaf
blocks
‣Ref ratio is a factor of 2
‣Mesh can be refined or de-
refined adaptively (customizable)
‣Morton space-filling curve
(Warren and Salmon, 1993) for
load balancing
2d
Various Tested Supports
‣FLASH has been tested on various machines, platforms,
compilers, OS, etc.
…
computer alias names
prototype makefiles:
linux, Mac OS X, …
FLASH Directory Structures
‣Most of the code in Fortran 90
‣highly modular with some set of coding rules
‣e.g.,API, name conventions, de-centralized database, etc.
‣configuration via setup python script
FLASH Directory Structures
FLASH Directory Structures
FLASH Directory Structures
Libraries/Softwares
‣Additional libraries/softwares required for FLASH:
‣fortran, C compilers with openMP
‣MPI
‣HYPRE for implicit diffusion
‣HDF5
FLASH Parallelizations
‣Two types of parallelizations:
‣ inter-node parallelism: domain decomposition with MPI
(distributed memory)
‣Paramesh or Chombo
‣ intra-node parallelism with OpenMP (shared memory)
‣thread block list
‣thread within block
‣More parallelization…
‣Parallel I/O with HDF5
FLASH Strong vs.Weak Scalings
BG/P to BG/Q transition
‣Intrepid BG/P
‣4 cores/node, 2GB/node, 40,960 nodes
‣FLASH has been run on Intrepid for the last several years
‣scales to the whole machine
‣MPI-only is sufficient
‣Mira BG/Q
‣4 hw threads/core, 16 cores/node, 16GB/node, 152 nodes
‣MPI-only approach not suitable for BG/Q
‣OpenMP directives have been recently added to FLASH to
take advantages of the additional intra-node parallelism
BG/Q
16 cores/node
16 GB/node
4 threads/core
FLASH Threading
‣Blocks 12~17 being assigned to
a single MPI rank
‣6 total blocks
‣5 leaf & 1 parent (#13)
‣Mesh is divided into blocks of
fixed size
‣Oct-tree hierarchy
‣Blocks are assigned to MPI
ranks
FLASH Threading
‣FLASH solvers update the
solution only on local leaf blocks
‣FLASH uses multiple threads
to speed up the solution update
Threading Strategy 1
‣Assign different blocks to
different threads
‣Assuming 2 threads per MPI
rank
‣thread 0 (blue) updates 3 full
blocks — 72 cells
‣thread 1 (yellow) updates 2
full blocks — 48 cells
‣‘thread block list’ — coarse
grained threading
Threading Strategy 2
‣Assign different cells from the
same block to different threads
‣Assuming 2 threads per MPI
rank
‣thread 0 (blue) updates 5
partial blocks — 60 cells
‣thread 1 (yellow) updates 5
partial blocks — 60 cells
‣‘thread within block’ - fine
grained threading
Strong Scaling of RT Flame
Strong Scaling Results
‣Good strong scaling for both multithreading strategies
‣Only 3 blocks of 16^3 cells on some MPI ranks for the 4096
MPI rank calculation (256 nodes)
‣Coarse-grained threading: list of blocks >> # of threads to
avoid load imbalance
‣Finer-grained threading performs slightly better
‣Better load balancing within an MPI rank
‣Performance advantage increases as work becomes more finely
distributed
Further Optimizations
‣For instance, reordering arrays in kernels reduced to time to
solution from 184 sec to 156 sec
Before: 4.79 sec
After: 0.37 sec
Further Optimizations
‣FLASH re-gridding optimization in Paramesh
Conclusion
‣It is challenging to modify large software like FLASH to contain
AMR and multiple physics modules for the next generation of
many-core architecture
‣ FLASH’s successful approach is to add OpenMP directives to
deliver a large, highly-capable piece of software to run efficiently
on the BG/Q platforms
‣coarse grained and fine grained threading strategies have been
explored and tested
‣fine grained threading performs better
‣Extra optimizations can speedup the code performance

Mais conteúdo relacionado

Mais procurados

NNPDF3.0: Next Generation Parton Distributions for the LHC Run II
NNPDF3.0: Next Generation Parton Distributions for the LHC Run IINNPDF3.0: Next Generation Parton Distributions for the LHC Run II
NNPDF3.0: Next Generation Parton Distributions for the LHC Run II
juanrojochacon
 
Parton Distributions and Standard Model Physics at the LHC
Parton Distributions and Standard Model Physics at the LHCParton Distributions and Standard Model Physics at the LHC
Parton Distributions and Standard Model Physics at the LHC
juanrojochacon
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
ACSG Section Montréal
 
5_IGARSS-2011-Joint-TSM-TDM-Mission-Planning-System_final.ppt
5_IGARSS-2011-Joint-TSM-TDM-Mission-Planning-System_final.ppt5_IGARSS-2011-Joint-TSM-TDM-Mission-Planning-System_final.ppt
5_IGARSS-2011-Joint-TSM-TDM-Mission-Planning-System_final.ppt
grssieee
 
Constraints on the gluon PDF from top quark differential distributions at NNLO
Constraints on the gluon PDF from top quark differential distributions at NNLOConstraints on the gluon PDF from top quark differential distributions at NNLO
Constraints on the gluon PDF from top quark differential distributions at NNLO
juanrojochacon
 
Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery ApproachLearning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
Pooyan Jamshidi
 

Mais procurados (20)

Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Phonons & Phonopy: Pro Tips (2014)
Phonons & Phonopy: Pro Tips (2014)Phonons & Phonopy: Pro Tips (2014)
Phonons & Phonopy: Pro Tips (2014)
 
Parton distributions with QED corrections and LHC phenomenology
Parton distributions with QED corrections and LHC phenomenologyParton distributions with QED corrections and LHC phenomenology
Parton distributions with QED corrections and LHC phenomenology
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
NNPDF3.0: Next Generation Parton Distributions for the LHC Run II
NNPDF3.0: Next Generation Parton Distributions for the LHC Run IINNPDF3.0: Next Generation Parton Distributions for the LHC Run II
NNPDF3.0: Next Generation Parton Distributions for the LHC Run II
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Parton Distributions and Standard Model Physics at the LHC
Parton Distributions and Standard Model Physics at the LHCParton Distributions and Standard Model Physics at the LHC
Parton Distributions and Standard Model Physics at the LHC
 
Paris Master Class 2011 - 06 Gpu Particle System
Paris Master Class 2011 - 06 Gpu Particle SystemParis Master Class 2011 - 06 Gpu Particle System
Paris Master Class 2011 - 06 Gpu Particle System
 
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterStockage, manipulation et analyse de données matricielles avec PostGIS Raster
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Parton Distributions and the search for New Physics at the LHC
Parton Distributions and the search for New Physics at the LHCParton Distributions and the search for New Physics at the LHC
Parton Distributions and the search for New Physics at the LHC
 
5_IGARSS-2011-Joint-TSM-TDM-Mission-Planning-System_final.ppt
5_IGARSS-2011-Joint-TSM-TDM-Mission-Planning-System_final.ppt5_IGARSS-2011-Joint-TSM-TDM-Mission-Planning-System_final.ppt
5_IGARSS-2011-Joint-TSM-TDM-Mission-Planning-System_final.ppt
 
Constraints on the gluon PDF from top quark differential distributions at NNLO
Constraints on the gluon PDF from top quark differential distributions at NNLOConstraints on the gluon PDF from top quark differential distributions at NNLO
Constraints on the gluon PDF from top quark differential distributions at NNLO
 
Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery ApproachLearning LWF Chain Graphs: A Markov Blanket Discovery Approach
Learning LWF Chain Graphs: A Markov Blanket Discovery Approach
 
Offshore Windfarm Design - Les 5 Structure Design and Load Calculations
Offshore Windfarm Design - Les 5 Structure Design and Load CalculationsOffshore Windfarm Design - Les 5 Structure Design and Load Calculations
Offshore Windfarm Design - Les 5 Structure Design and Load Calculations
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
 
Recent progress in proton and nuclear PDFs at small-x
Recent progress in proton and nuclear PDFs at small-xRecent progress in proton and nuclear PDFs at small-x
Recent progress in proton and nuclear PDFs at small-x
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 

Semelhante a AMS 250 - High-Performance, Massively Parallel Computing with FLASH

Dongwook's talk on High-Performace Computing
Dongwook's talk on High-Performace ComputingDongwook's talk on High-Performace Computing
Dongwook's talk on High-Performace Computing
dongwook159
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
inside-BigData.com
 
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
inside-BigData.com
 
Slide 1
Slide 1Slide 1
Slide 1
butest
 
Slide 1
Slide 1Slide 1
Slide 1
butest
 

Semelhante a AMS 250 - High-Performance, Massively Parallel Computing with FLASH (20)

Dongwook's talk on High-Performace Computing
Dongwook's talk on High-Performace ComputingDongwook's talk on High-Performace Computing
Dongwook's talk on High-Performace Computing
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
An Update on Arm HPC
An Update on Arm HPCAn Update on Arm HPC
An Update on Arm HPC
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
The Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPC
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
 
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Slide 1
Slide 1Slide 1
Slide 1
 
Slide 1
Slide 1Slide 1
Slide 1
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 

Último

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 

Último (20)

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

AMS 250 - High-Performance, Massively Parallel Computing with FLASH

  • 1. Dongwook Lee Applied Mathematics & Statistics University of California, Santa Cruz High Performance, Massively Parallel Computing with FLASH AMS 250, June 2, 2015 FLASH Simulation of a 3D Core-collapse Supernova Courtesy of S. Couch MIRA, BG/Q, Argonne National Lab 49,152 nodes, 786,432 cores OMEGA Laser (US)
  • 2. Parallel Computing in1917 Richardson (1881-1953) introduced a principle of parallel computation using men power in order to conduct weather predictions “High- Performance” Scientific Computation =
  • 3. HPC in 21st Century ‣ Goal:To solve large problems in science, engineering, or business using multi physics, multi scale simulations on large scale computing platforms ‣Hardware: Currently petaflops/s (going to exaflops/s), heterogeneous architectures, inter-node and intra-node parallelism ‣Software: MPI, OpenMP (fine vs. coarse grained), discretizers, partitioners, solvers and integrators, grid mesh, grid adaptivity, load balancing, UQ, data compression, config system, compilers, debuggers, profilers, translators, performance optimizers, I/O, workflow controllers, visualization systems, etc… ‣Hardware & software: concurrent programming models, co-Design, fault monitoring, etc…
  • 4. Flop/s Rate Increase since 1988 ‣ACM Gordon Bell Prize ‣ 1 Gigaflop/s — structural simulation, 1988 ‣ 1 Teraflop/s — compressible turbulence, 1998 ‣ 1 Petaflop/s — molecular dynamics simulation, 2008 ‣Total of 6 orders of magnitude of improvements for individual cores over the two decades: ‣ computer engineering (only two orders of magnitude) ‣ concurrency (the rest four orders of magnitude)
  • 5. High Performance Computing (HPC) ‣This tension between computation & memory brings a paradigm shift in numerical algorithms for HPC ‣To enable scientific computing on HPC architectures: ▪ efficient parallel computing, (e.g., data parallelism, task parallelism, MPI, multi-threading, GPU accelerator, etc.) ▪ better numerical algorithms for HPC
  • 6. Astrophysics Application Gravitationally Confined Detonation of Ia SN: Calder et al (2003); Calder & Lamb (2004);Townsley et al (2007); Plewa (2007); Plewa and Kasen (2007); Jordan et al (2008); Meakin et al (2009); Jordan et al (2012) https://vimeo.com/40696524
  • 7. Astrophysics Application Large-scale FLASH simulations of Buoyancy-driven Turbulent Nuclear Combustion https://vimeo.com/40691923
  • 8. In collaboration with the research teams in U of Chicago & Oxford Univ. Laboratory Astrophysics: HEDP 3D Simulation
  • 11. Scientific Computing Tasks Science Problem (IC, BC, ODE/PDE) Simulator (code, computer) Results (Validation, verification, analysis)
  • 12. Roles of Scientific Computing ‣Scientific simulations of multi physics, multi scale phenomena have enabled us to enhance our scientific understanding via ‣ advances in modeling and algorithms ‣ growth of computing resources
  • 13. Challenges in HPC ‣ Maximizing the scientific outcome of simulations, especially on high-performance computing (HPC) resources, requires many trade-offs ‣ Scientists, engineers, & software developers are often challenged to explore previously unexplored regimes in both physics and computing to maximally gain scientific utilization of large HPC
  • 14. Today’s Topics ‣ I am going to present a case study with the FLASH code on the following topics: ‣ Code architecture ‣ parallelization (MPI & threading) ‣ optimization (algorithm tweaking & improvements) ‣ FLASH performance on HPC
  • 15. FLASH Code ‣FLASH is is free, open source code for astrophysics and HEDP (http://flash.uchicago.edu) ▪ modular, multi-physics, adaptive mesh refinement (AMR), parallel (MPI & OpenMP), finite-volume Eulerian compressible code for solving hydrodynamics and MHD ▪ professionally software engineered and maintained (daily regression test suite, code verification/validation), inline/online documentation ▪ FLASH can run on various platforms from laptops to supercomputing (peta-scale) systems such as IBM BG/P and BG/Q with great scaling over a hundred thousands processors
  • 16. FLASH Code ‣FLASH is is free, open source code for astrophysics and HEDP (http://flash.uchicago.edu) ▪ modular, multi-physics, adaptive mesh refinement (AMR), parallel (MPI & OpenMP), finite-volume Eulerian compressible code for solving hydrodynamics and MHD ▪ professionally software engineered and maintained (daily regression test suite, code verification/validation), inline/online documentation ▪ FLASH can run on various platforms from laptops to supercomputing (peta-scale) systems such as IBM BG/P and BG/Q with great scaling over a hundred thousands processors • Over 1.2 million lines of code (Fortran, C, pyton) • 25% are comments • Extensive code docs in user’s manual (~ 500 pages) • Great scaling ~ 100,000 procs
  • 17. FLASH around the World Canada US GermanyItaly Canada Germany Monthly Notices of the Royal Astronomical Society Volume 355 Issue 3 Page 995 - December 2004 doi:10.1111/j.1365-2966.2004.08381.x Quenching cluster cooling flows with recurrent hot plasma bubbles Claudio Dalla Vecchia1, Richard G. Bower1, Tom Theuns1,2, Michael L. Balogh1, Pasquale Mazzotta3 and Carlos S. Frenk1 UK Netherlands
  • 18. Papers using FLASH year count 2000 10 2001 7 2002 11 2003 21 2004 46 2005 48 2006 68 2007 72 2008 93 2009 98 2010 93 2011 103 2012 122 2013 12 804 0 20 40 60 80 100 120 140 count Year Number of Papers downloads > 8500; authors > 1500; papers > 1000
  • 19. Research Applications ‣Major Research Applications ‣thermonuclear flashes ‣high energy density physics (HEDP) - B field amplification ‣fluid-structure interaction ‣star formation ‣star-star & star-planets interactions ‣cosmology ‣galaxy & galaxy cluster simulations ‣turbulence ‣…
  • 20. Scientific Simulations using FLASH cosmological cluster formation supersonic MHD turbulence Type Ia SN RT CCSN ram pressure stripping laser slab rigid body structure Accretion Torus LULI/Vulcan experiments: B-field generation/amplification
  • 21. History ‣1997: Founding of ASCI (Accelerated Strategy Computing Initiative) FLASH Center ‣ thermonuclear flashes at neutron stars and WD surfaces ‣2000: first FLASH code release ‣ Fryxell, Olson, Ricker,Timmes, Zingale, Lamb, MacNeice, Rosner, Tururan,Tufo (The FLASH Paper,ApJS) ‣PPM hydro FVM code, no MHD ‣no-self gravity ‣AMR using PARAMESH
  • 22. History ‣2002: Major revision updated to 2.3 ‣MHD solver based on the 8-wave scheme: directionally split, FVM (Powell, Roe, Linde, Gombosi, De Zeeuw, 1999) ‣self-gravity/Poisson solver (Ricker) ‣tracer particle ‣cosmology ‣HDF5 I/O
  • 23. History ‣2008: completely restructured version 3 ‣unsplit HD & USM-MHD solvers (Dongwook, JCP 2009; 2013) ‣more flexible grid structures (UG & AMR) ‣decentralized “database” structure ‣reorganized directory structure ‣2011:Version 4 (Last released ver. 4.2.2, 2014) ‣full 3D AMR for MHD ‣many new physical modules ‣e.g., implicit diffusion solver for thermal conduction, energy deposition via laser, 3-temperature for HEDP, solid-boundary interface
  • 25. Domain Decomposition ‣ Adaptive Mesh Refinement (w/ Paramesh) ▪conventional parallelism via MPI (Message Passing Interface) ▪domain decomposition distributed over multiple processor units ▪distributed memory (cf. shared memory) uniform grid oct-tree-based block AMR patch-based AMR Single block
  • 26. FLASH Grid Structures ‣Two ways to setup ‣AMR using PARAMESH (MacNiece et al, 2000) ‣block structured, oct-tree based AMR ‣alternative CHOMBO (Colella et al.) library is under development ‣UG without AMR overhead
  • 27. FLASH AMR Grid Structures ‣Oct-tree structure where each branch of tree produces leaf blocks ‣FLASH evolves only on the leaf blocks ‣Ref ratio is a factor of 2 ‣Mesh can be refined or de- refined adaptively (customizable) ‣Morton space-filling curve (Warren and Salmon, 1993) for load balancing 2d
  • 28. FLASH AMR Grid Structures ‣Oct-tree structure where each branch of tree produces leaf blocks ‣FLASH evolves only on the leaf blocks ‣Ref ratio is a factor of 2 ‣Mesh can be refined or de- refined adaptively (customizable) ‣Morton space-filling curve (Warren and Salmon, 1993) for load balancing 2d
  • 29. Various Tested Supports ‣FLASH has been tested on various machines, platforms, compilers, OS, etc. … computer alias names prototype makefiles: linux, Mac OS X, …
  • 30. FLASH Directory Structures ‣Most of the code in Fortran 90 ‣highly modular with some set of coding rules ‣e.g.,API, name conventions, de-centralized database, etc. ‣configuration via setup python script
  • 34. Libraries/Softwares ‣Additional libraries/softwares required for FLASH: ‣fortran, C compilers with openMP ‣MPI ‣HYPRE for implicit diffusion ‣HDF5
  • 35. FLASH Parallelizations ‣Two types of parallelizations: ‣ inter-node parallelism: domain decomposition with MPI (distributed memory) ‣Paramesh or Chombo ‣ intra-node parallelism with OpenMP (shared memory) ‣thread block list ‣thread within block ‣More parallelization… ‣Parallel I/O with HDF5
  • 37. BG/P to BG/Q transition ‣Intrepid BG/P ‣4 cores/node, 2GB/node, 40,960 nodes ‣FLASH has been run on Intrepid for the last several years ‣scales to the whole machine ‣MPI-only is sufficient ‣Mira BG/Q ‣4 hw threads/core, 16 cores/node, 16GB/node, 152 nodes ‣MPI-only approach not suitable for BG/Q ‣OpenMP directives have been recently added to FLASH to take advantages of the additional intra-node parallelism
  • 39. FLASH Threading ‣Blocks 12~17 being assigned to a single MPI rank ‣6 total blocks ‣5 leaf & 1 parent (#13) ‣Mesh is divided into blocks of fixed size ‣Oct-tree hierarchy ‣Blocks are assigned to MPI ranks
  • 40. FLASH Threading ‣FLASH solvers update the solution only on local leaf blocks ‣FLASH uses multiple threads to speed up the solution update
  • 41. Threading Strategy 1 ‣Assign different blocks to different threads ‣Assuming 2 threads per MPI rank ‣thread 0 (blue) updates 3 full blocks — 72 cells ‣thread 1 (yellow) updates 2 full blocks — 48 cells ‣‘thread block list’ — coarse grained threading
  • 42. Threading Strategy 2 ‣Assign different cells from the same block to different threads ‣Assuming 2 threads per MPI rank ‣thread 0 (blue) updates 5 partial blocks — 60 cells ‣thread 1 (yellow) updates 5 partial blocks — 60 cells ‣‘thread within block’ - fine grained threading
  • 43. Strong Scaling of RT Flame
  • 44. Strong Scaling Results ‣Good strong scaling for both multithreading strategies ‣Only 3 blocks of 16^3 cells on some MPI ranks for the 4096 MPI rank calculation (256 nodes) ‣Coarse-grained threading: list of blocks >> # of threads to avoid load imbalance ‣Finer-grained threading performs slightly better ‣Better load balancing within an MPI rank ‣Performance advantage increases as work becomes more finely distributed
  • 45. Further Optimizations ‣For instance, reordering arrays in kernels reduced to time to solution from 184 sec to 156 sec Before: 4.79 sec After: 0.37 sec
  • 46. Further Optimizations ‣FLASH re-gridding optimization in Paramesh
  • 47. Conclusion ‣It is challenging to modify large software like FLASH to contain AMR and multiple physics modules for the next generation of many-core architecture ‣ FLASH’s successful approach is to add OpenMP directives to deliver a large, highly-capable piece of software to run efficiently on the BG/Q platforms ‣coarse grained and fine grained threading strategies have been explored and tested ‣fine grained threading performs better ‣Extra optimizations can speedup the code performance