SlideShare uma empresa Scribd logo
1 de 12
LLNL-PRES-668437
This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
High Performance of Finite-Volume Methods
through Increased Arithmetic Intensity
SIAM CSE 2015
J. Loffeld and J.A.F. Hittinger
3/17/2015
Lawrence Livermore National Laboratory LLNL-PRES-668437
2
To get high flops rate, you need high arithmetic
intensity
1024
512
256
128
64
32
16
Performance(GFlop/s)
321684211/2
Arithmetic Intensity (flop/byte)
Machine peak
Machinebalance
No FMA
No AVX
Low-order
PDE Stencils
FFTs Dense Matrix Multiply
Greaterconcurrency
Increasing order
of FV methods
improves AI
Lawrence Livermore National Laboratory LLNL-PRES-668437
3
Higher-order finite-volume methods require
high-order flux approximations
 Update formula for
conservation laws:
• We are considering AI for one
time step
 Approximating the flux
averages gives FV method
• High-order approximations give
high-order method
High-order flux approximations
use more flopsFlux update stencil
Lawrence Livermore National Laboratory LLNL-PRES-668437
4
High-order flux approximations include more
neighbor information
Eighth-order central flux:
Incorporating information from neighbors
gives high flop count
 Derived upwind and central high-order
schemes for 5th through 8th order
 [McCorquodale, et al. CAMCS (2011)],
[Colella, et al. J. Comput. Phys. (2011)]
Lawrence Livermore National Laboratory LLNL-PRES-668437
5
AI is most easily calculated in the limit of
infinite cache size
 Assume unlimited cache
• Useful for later refinement
• Target AI
 Load and store data only
once per cell
 Temporaries between
stencils absorbed by
cache
 Re-use of data allows
high AI
Lawrence Livermore National Laboratory LLNL-PRES-668437
6
Theoretical maximum AI reaches target for
sixth and eighth order
Modern machine balance
We would see these results in practice
if machines had infinite cache space
Flops for example step:
𝑐 8
𝐷 − 1
2
+ 1 𝐷(𝑁 + 1)(𝑁 + 2) 𝐷−1
 Formulas parameterized by
• Dimension
• Number of components
• Domain size
• Flops cost of flux function
Lawrence Livermore National Laboratory LLNL-PRES-668437
7
In reality, machines have finite-size caches
 Overhead from re-
fetching halo cells
• Sixth order halo width is 4
• Eighth order halo width is 6
• Halo cells limit minimum
block size
Each block stores
values per component
Lawrence Livermore National Laboratory LLNL-PRES-668437
8
 Vulcan – IBM Blue Gene Q
• 32MB L2 cache (last level)
• Cache line is 128 bytes
 BGPM for hardware counters
• Flops counts are highly accurate
• Overcount DRAM transfers
— Turn off prefetching
— Overhead from API
— Get aliasing error from large cache line
— Random noise
To verify the predictions, we used the hardware
counters on Vulcan
Machine
peak
Machine
balance
4.8 flop/byte
205
Gflop/s
Lawrence Livermore National Laboratory LLNL-PRES-668437
9
Measured AI with ND cache blocking compares
well to theory
Modern machine balance
• Higher order methods have wider stencils
• Blocks need wide halos
• Less efficient cache reuse
Fourth theoretical
Fourth measured
Sixth theoretical
Sixth measured
Eighth theoretical
Eighth measured
Lawrence Livermore National Laboratory LLNL-PRES-668437
10
Because of halo, 3D blocking requires too
much cache space
 Need block length about 32
to keep overhead modest
• For eighth-order, 1.55x
• For sixth-order, 1.34x
 Each block requires
 For 5-component system
(e.g. Euler), need 5 MB
cache per 32-wide block
• Current processors have ~2 to
2.5 MB/core
On 1283
size domain
Lawrence Livermore National Laboratory LLNL-PRES-668437
11
However, vertical iteration of rectangular cache
blocks can improve cache usage
 Successively evaluate
blocks in columns
• No re-fetching of halo in z
direction
 Storage per block:
 For 8 × 322 blocks in 1283
size domain:
Order Overhead Size AI
6 1.21x 1.5MB 13.6
8 1.33x 2.1MB 21.8 High AI with realistic cache size
Lawrence Livermore National Laboratory LLNL-PRES-668437
12
 Derived high-order finite-volume schemes
 Conducted AI analysis that shows high AI can be
obtained with realistic cache sizes
Summary:
Machine
peak
Machine
balance
Current and future work:
 AI is an important metric for on-node utilization, but it
does not equal performance
• Latency, concurrency, cache blocking
• [Olschanowski et al. SC (2014)] for 4th order
 Need to consider ways to reduce halo width to
further reduce overhead
 Include nonlinear limiting in the flux AI analysis
• Will further increase ops without increasing data
transfers

Mais conteúdo relacionado

Mais procurados

It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...babak hosseini
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...Spark Summit
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for SolrToke Eskildsen
 
EventVisualization
EventVisualizationEventVisualization
EventVisualizationHenoch Wong
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...InfluxData
 
Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Jinto George
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginInfluxData
 
Register renaming technique
Register renaming techniqueRegister renaming technique
Register renaming techniqueJinto George
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetupKostas Tzoumas
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
 
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...InfluxData
 
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...InfluxData
 
Reorder buf
Reorder bufReorder buf
Reorder bufAarsh Ps
 

Mais procurados (20)

It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for Solr
 
EventVisualization
EventVisualizationEventVisualization
EventVisualization
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
 
Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
Register renaming technique
Register renaming techniqueRegister renaming technique
Register renaming technique
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
CNS_poster12
CNS_poster12CNS_poster12
CNS_poster12
 
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
 
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
 
Reorder buf
Reorder bufReorder buf
Reorder buf
 

Destaque

З досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиЗ досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиnjhujdbwz
 
RogerResume2015v1
RogerResume2015v1RogerResume2015v1
RogerResume2015v1Roger Walls
 
обдаровані учні
обдаровані учніобдаровані учні
обдаровані учніnjhujdbwz
 
Леся Українка
Леся УкраїнкаЛеся Українка
Леся Українкаnjhujdbwz
 
Микола Вороний 10кл.
Микола Вороний 10кл.Микола Вороний 10кл.
Микола Вороний 10кл.njhujdbwz
 
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. МаланюкаТорговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюкаnjhujdbwz
 
Animaciã³n tecnologia final
Animaciã³n tecnologia finalAnimaciã³n tecnologia final
Animaciã³n tecnologia finalSilvia Carmona
 
Resume_bibhu_prasad_dash
Resume_bibhu_prasad_dash Resume_bibhu_prasad_dash
Resume_bibhu_prasad_dash Bibhu Dash
 
POSmedia_kampan roku 2012
POSmedia_kampan roku 2012POSmedia_kampan roku 2012
POSmedia_kampan roku 2012Marie Machatova
 
The top 5 hearing aid myths exposed
The top 5 hearing aid myths exposed   The top 5 hearing aid myths exposed
The top 5 hearing aid myths exposed Bright Audiology
 
М та ТМ_лабораторні роботи
М та ТМ_лабораторні роботиМ та ТМ_лабораторні роботи
М та ТМ_лабораторні роботиmelnyk_olja
 
16158 євген маланюк
16158 євген маланюк16158 євген маланюк
16158 євген маланюкnjhujdbwz
 
З досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиЗ досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиnjhujdbwz
 

Destaque (17)

З досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиЗ досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літератури
 
RogerResume2015v1
RogerResume2015v1RogerResume2015v1
RogerResume2015v1
 
обдаровані учні
обдаровані учніобдаровані учні
обдаровані учні
 
Леся Українка
Леся УкраїнкаЛеся Українка
Леся Українка
 
Микола Вороний 10кл.
Микола Вороний 10кл.Микола Вороний 10кл.
Микола Вороний 10кл.
 
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. МаланюкаТорговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
 
Animaciã³n tecnologia final
Animaciã³n tecnologia finalAnimaciã³n tecnologia final
Animaciã³n tecnologia final
 
Gaurav Resume
Gaurav ResumeGaurav Resume
Gaurav Resume
 
Resume_bibhu_prasad_dash
Resume_bibhu_prasad_dash Resume_bibhu_prasad_dash
Resume_bibhu_prasad_dash
 
POSmedia_kampan roku 2012
POSmedia_kampan roku 2012POSmedia_kampan roku 2012
POSmedia_kampan roku 2012
 
Company preso
Company presoCompany preso
Company preso
 
The top 5 hearing aid myths exposed
The top 5 hearing aid myths exposed   The top 5 hearing aid myths exposed
The top 5 hearing aid myths exposed
 
М та ТМ_лабораторні роботи
М та ТМ_лабораторні роботиМ та ТМ_лабораторні роботи
М та ТМ_лабораторні роботи
 
16158 євген маланюк
16158 євген маланюк16158 євген маланюк
16158 євген маланюк
 
З досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиЗ досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літератури
 
Kate Williams - CV
Kate Williams - CVKate Williams - CV
Kate Williams - CV
 
Poster Breakdown
Poster BreakdownPoster Breakdown
Poster Breakdown
 

Semelhante a Loffeld_SIAMCSE15

Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...Jorge E. López de Vergara Méndez
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirementsinside-BigData.com
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 

Semelhante a Loffeld_SIAMCSE15 (20)

Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
supercomputer
supercomputersupercomputer
supercomputer
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirements
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
The Google file system
The Google file systemThe Google file system
The Google file system
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engine
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Super Computers
Super ComputersSuper Computers
Super Computers
 

Mais de Karen Pao

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15Karen Pao
 
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Karen Pao
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15Karen Pao
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15Karen Pao
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15Karen Pao
 
Austin_SIAMCSE15
Austin_SIAMCSE15Austin_SIAMCSE15
Austin_SIAMCSE15Karen Pao
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15Karen Pao
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15Karen Pao
 

Mais de Karen Pao (8)

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
 
Austin_SIAMCSE15
Austin_SIAMCSE15Austin_SIAMCSE15
Austin_SIAMCSE15
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15
 

Último

Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 

Último (20)

Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 

Loffeld_SIAMCSE15

  • 1. LLNL-PRES-668437 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC High Performance of Finite-Volume Methods through Increased Arithmetic Intensity SIAM CSE 2015 J. Loffeld and J.A.F. Hittinger 3/17/2015
  • 2. Lawrence Livermore National Laboratory LLNL-PRES-668437 2 To get high flops rate, you need high arithmetic intensity 1024 512 256 128 64 32 16 Performance(GFlop/s) 321684211/2 Arithmetic Intensity (flop/byte) Machine peak Machinebalance No FMA No AVX Low-order PDE Stencils FFTs Dense Matrix Multiply Greaterconcurrency Increasing order of FV methods improves AI
  • 3. Lawrence Livermore National Laboratory LLNL-PRES-668437 3 Higher-order finite-volume methods require high-order flux approximations  Update formula for conservation laws: • We are considering AI for one time step  Approximating the flux averages gives FV method • High-order approximations give high-order method High-order flux approximations use more flopsFlux update stencil
  • 4. Lawrence Livermore National Laboratory LLNL-PRES-668437 4 High-order flux approximations include more neighbor information Eighth-order central flux: Incorporating information from neighbors gives high flop count  Derived upwind and central high-order schemes for 5th through 8th order  [McCorquodale, et al. CAMCS (2011)], [Colella, et al. J. Comput. Phys. (2011)]
  • 5. Lawrence Livermore National Laboratory LLNL-PRES-668437 5 AI is most easily calculated in the limit of infinite cache size  Assume unlimited cache • Useful for later refinement • Target AI  Load and store data only once per cell  Temporaries between stencils absorbed by cache  Re-use of data allows high AI
  • 6. Lawrence Livermore National Laboratory LLNL-PRES-668437 6 Theoretical maximum AI reaches target for sixth and eighth order Modern machine balance We would see these results in practice if machines had infinite cache space Flops for example step: 𝑐 8 𝐷 − 1 2 + 1 𝐷(𝑁 + 1)(𝑁 + 2) 𝐷−1  Formulas parameterized by • Dimension • Number of components • Domain size • Flops cost of flux function
  • 7. Lawrence Livermore National Laboratory LLNL-PRES-668437 7 In reality, machines have finite-size caches  Overhead from re- fetching halo cells • Sixth order halo width is 4 • Eighth order halo width is 6 • Halo cells limit minimum block size Each block stores values per component
  • 8. Lawrence Livermore National Laboratory LLNL-PRES-668437 8  Vulcan – IBM Blue Gene Q • 32MB L2 cache (last level) • Cache line is 128 bytes  BGPM for hardware counters • Flops counts are highly accurate • Overcount DRAM transfers — Turn off prefetching — Overhead from API — Get aliasing error from large cache line — Random noise To verify the predictions, we used the hardware counters on Vulcan Machine peak Machine balance 4.8 flop/byte 205 Gflop/s
  • 9. Lawrence Livermore National Laboratory LLNL-PRES-668437 9 Measured AI with ND cache blocking compares well to theory Modern machine balance • Higher order methods have wider stencils • Blocks need wide halos • Less efficient cache reuse Fourth theoretical Fourth measured Sixth theoretical Sixth measured Eighth theoretical Eighth measured
  • 10. Lawrence Livermore National Laboratory LLNL-PRES-668437 10 Because of halo, 3D blocking requires too much cache space  Need block length about 32 to keep overhead modest • For eighth-order, 1.55x • For sixth-order, 1.34x  Each block requires  For 5-component system (e.g. Euler), need 5 MB cache per 32-wide block • Current processors have ~2 to 2.5 MB/core On 1283 size domain
  • 11. Lawrence Livermore National Laboratory LLNL-PRES-668437 11 However, vertical iteration of rectangular cache blocks can improve cache usage  Successively evaluate blocks in columns • No re-fetching of halo in z direction  Storage per block:  For 8 × 322 blocks in 1283 size domain: Order Overhead Size AI 6 1.21x 1.5MB 13.6 8 1.33x 2.1MB 21.8 High AI with realistic cache size
  • 12. Lawrence Livermore National Laboratory LLNL-PRES-668437 12  Derived high-order finite-volume schemes  Conducted AI analysis that shows high AI can be obtained with realistic cache sizes Summary: Machine peak Machine balance Current and future work:  AI is an important metric for on-node utilization, but it does not equal performance • Latency, concurrency, cache blocking • [Olschanowski et al. SC (2014)] for 4th order  Need to consider ways to reduce halo width to further reduce overhead  Include nonlinear limiting in the flux AI analysis • Will further increase ops without increasing data transfers

Notas do Editor

  1. Kernels are memory bound. Cores are 90% idle. Roofline model explains phenomenon. Relates performance to AI and CPU features.