SlideShare uma empresa Scribd logo
1 de 12
LLNL-PRES-668437
This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
High Performance of Finite-Volume Methods
through Increased Arithmetic Intensity
SIAM CSE 2015
J. Loffeld and J.A.F. Hittinger
3/17/2015
Lawrence Livermore National Laboratory LLNL-PRES-668437
2
To get high flops rate, you need high arithmetic
intensity
1024
512
256
128
64
32
16
Performance(GFlop/s)
321684211/2
Arithmetic Intensity (flop/byte)
Machine peak
Machinebalance
No FMA
No AVX
Low-order
PDE Stencils
FFTs Dense Matrix Multiply
Greaterconcurrency
Increasing order
of FV methods
improves AI
Lawrence Livermore National Laboratory LLNL-PRES-668437
3
Higher-order finite-volume methods require
high-order flux approximations
 Update formula for
conservation laws:
• We are considering AI for one
time step
 Approximating the flux
averages gives FV method
• High-order approximations give
high-order method
High-order flux approximations
use more flopsFlux update stencil
Lawrence Livermore National Laboratory LLNL-PRES-668437
4
High-order flux approximations include more
neighbor information
Eighth-order central flux:
Incorporating information from neighbors
gives high flop count
 Derived upwind and central high-order
schemes for 5th through 8th order
 [McCorquodale, et al. CAMCS (2011)],
[Colella, et al. J. Comput. Phys. (2011)]
Lawrence Livermore National Laboratory LLNL-PRES-668437
5
AI is most easily calculated in the limit of
infinite cache size
 Assume unlimited cache
• Useful for later refinement
• Target AI
 Load and store data only
once per cell
 Temporaries between
stencils absorbed by
cache
 Re-use of data allows
high AI
Lawrence Livermore National Laboratory LLNL-PRES-668437
6
Theoretical maximum AI reaches target for
sixth and eighth order
Modern machine balance
We would see these results in practice
if machines had infinite cache space
Flops for example step:
𝑐 8
𝐷 − 1
2
+ 1 𝐷(𝑁 + 1)(𝑁 + 2) 𝐷−1
 Formulas parameterized by
• Dimension
• Number of components
• Domain size
• Flops cost of flux function
Lawrence Livermore National Laboratory LLNL-PRES-668437
7
In reality, machines have finite-size caches
 Overhead from re-
fetching halo cells
• Sixth order halo width is 4
• Eighth order halo width is 6
• Halo cells limit minimum
block size
Each block stores
values per component
Lawrence Livermore National Laboratory LLNL-PRES-668437
8
 Vulcan – IBM Blue Gene Q
• 32MB L2 cache (last level)
• Cache line is 128 bytes
 BGPM for hardware counters
• Flops counts are highly accurate
• Overcount DRAM transfers
— Turn off prefetching
— Overhead from API
— Get aliasing error from large cache line
— Random noise
To verify the predictions, we used the hardware
counters on Vulcan
Machine
peak
Machine
balance
4.8 flop/byte
205
Gflop/s
Lawrence Livermore National Laboratory LLNL-PRES-668437
9
Measured AI with ND cache blocking compares
well to theory
Modern machine balance
• Higher order methods have wider stencils
• Blocks need wide halos
• Less efficient cache reuse
Fourth theoretical
Fourth measured
Sixth theoretical
Sixth measured
Eighth theoretical
Eighth measured
Lawrence Livermore National Laboratory LLNL-PRES-668437
10
Because of halo, 3D blocking requires too
much cache space
 Need block length about 32
to keep overhead modest
• For eighth-order, 1.55x
• For sixth-order, 1.34x
 Each block requires
 For 5-component system
(e.g. Euler), need 5 MB
cache per 32-wide block
• Current processors have ~2 to
2.5 MB/core
On 1283
size domain
Lawrence Livermore National Laboratory LLNL-PRES-668437
11
However, vertical iteration of rectangular cache
blocks can improve cache usage
 Successively evaluate
blocks in columns
• No re-fetching of halo in z
direction
 Storage per block:
 For 8 × 322 blocks in 1283
size domain:
Order Overhead Size AI
6 1.21x 1.5MB 13.6
8 1.33x 2.1MB 21.8 High AI with realistic cache size
Lawrence Livermore National Laboratory LLNL-PRES-668437
12
 Derived high-order finite-volume schemes
 Conducted AI analysis that shows high AI can be
obtained with realistic cache sizes
Summary:
Machine
peak
Machine
balance
Current and future work:
 AI is an important metric for on-node utilization, but it
does not equal performance
• Latency, concurrency, cache blocking
• [Olschanowski et al. SC (2014)] for 4th order
 Need to consider ways to reduce halo width to
further reduce overhead
 Include nonlinear limiting in the flux AI analysis
• Will further increase ops without increasing data
transfers

Mais conteúdo relacionado

Mais procurados

It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...babak hosseini
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...Spark Summit
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for SolrToke Eskildsen
 
EventVisualization
EventVisualizationEventVisualization
EventVisualizationHenoch Wong
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...InfluxData
 
Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Jinto George
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginInfluxData
 
Register renaming technique
Register renaming techniqueRegister renaming technique
Register renaming techniqueJinto George
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetupKostas Tzoumas
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
 
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...InfluxData
 
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...InfluxData
 
Reorder buf
Reorder bufReorder buf
Reorder bufAarsh Ps
 

Mais procurados (20)

It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
Large-Margin Multiple Kernel Learning for Discriminative Features Selection a...
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
 
Faceting optimizations for Solr
Faceting optimizations for SolrFaceting optimizations for Solr
Faceting optimizations for Solr
 
EventVisualization
EventVisualizationEventVisualization
EventVisualization
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
 
Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
Register renaming technique
Register renaming techniqueRegister renaming technique
Register renaming technique
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
CNS_poster12
CNS_poster12CNS_poster12
CNS_poster12
 
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
DISTRIBUTED PERFORMANCE ANALYSIS USING INFLUXDB AND THE LINUX EBPF VIRTUAL MA...
 
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
eBPF Powered Distributed Kubernetes Performance Analysis - Lorenzo Fontana, I...
 
Reorder buf
Reorder bufReorder buf
Reorder buf
 

Destaque

З досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиЗ досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиnjhujdbwz
 
RogerResume2015v1
RogerResume2015v1RogerResume2015v1
RogerResume2015v1Roger Walls
 
обдаровані учні
обдаровані учніобдаровані учні
обдаровані учніnjhujdbwz
 
Леся Українка
Леся УкраїнкаЛеся Українка
Леся Українкаnjhujdbwz
 
Микола Вороний 10кл.
Микола Вороний 10кл.Микола Вороний 10кл.
Микола Вороний 10кл.njhujdbwz
 
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. МаланюкаТорговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюкаnjhujdbwz
 
Animaciã³n tecnologia final
Animaciã³n tecnologia finalAnimaciã³n tecnologia final
Animaciã³n tecnologia finalSilvia Carmona
 
Resume_bibhu_prasad_dash
Resume_bibhu_prasad_dash Resume_bibhu_prasad_dash
Resume_bibhu_prasad_dash Bibhu Dash
 
POSmedia_kampan roku 2012
POSmedia_kampan roku 2012POSmedia_kampan roku 2012
POSmedia_kampan roku 2012Marie Machatova
 
The top 5 hearing aid myths exposed
The top 5 hearing aid myths exposed   The top 5 hearing aid myths exposed
The top 5 hearing aid myths exposed Bright Audiology
 
М та ТМ_лабораторні роботи
М та ТМ_лабораторні роботиМ та ТМ_лабораторні роботи
М та ТМ_лабораторні роботиmelnyk_olja
 
16158 євген маланюк
16158 євген маланюк16158 євген маланюк
16158 євген маланюкnjhujdbwz
 
З досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиЗ досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиnjhujdbwz
 

Destaque (17)

З досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиЗ досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літератури
 
RogerResume2015v1
RogerResume2015v1RogerResume2015v1
RogerResume2015v1
 
обдаровані учні
обдаровані учніобдаровані учні
обдаровані учні
 
Леся Українка
Леся УкраїнкаЛеся Українка
Леся Українка
 
Микола Вороний 10кл.
Микола Вороний 10кл.Микола Вороний 10кл.
Микола Вороний 10кл.
 
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. МаланюкаТорговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
Торговицька ЗШ І-ІІІ ст ім. Є.Ф. Маланюка
 
Animaciã³n tecnologia final
Animaciã³n tecnologia finalAnimaciã³n tecnologia final
Animaciã³n tecnologia final
 
Gaurav Resume
Gaurav ResumeGaurav Resume
Gaurav Resume
 
Resume_bibhu_prasad_dash
Resume_bibhu_prasad_dash Resume_bibhu_prasad_dash
Resume_bibhu_prasad_dash
 
POSmedia_kampan roku 2012
POSmedia_kampan roku 2012POSmedia_kampan roku 2012
POSmedia_kampan roku 2012
 
Company preso
Company presoCompany preso
Company preso
 
The top 5 hearing aid myths exposed
The top 5 hearing aid myths exposed   The top 5 hearing aid myths exposed
The top 5 hearing aid myths exposed
 
М та ТМ_лабораторні роботи
М та ТМ_лабораторні роботиМ та ТМ_лабораторні роботи
М та ТМ_лабораторні роботи
 
16158 євген маланюк
16158 євген маланюк16158 євген маланюк
16158 євген маланюк
 
З досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літературиЗ досвіду роботи вчителя української мови та літератури
З досвіду роботи вчителя української мови та літератури
 
Kate Williams - CV
Kate Williams - CVKate Williams - CV
Kate Williams - CV
 
Poster Breakdown
Poster BreakdownPoster Breakdown
Poster Breakdown
 

Semelhante a Loffeld_SIAMCSE15

Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...Jorge E. López de Vergara Méndez
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirementsinside-BigData.com
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 

Semelhante a Loffeld_SIAMCSE15 (20)

Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
supercomputer
supercomputersupercomputer
supercomputer
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirements
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
The Google file system
The Google file systemThe Google file system
The Google file system
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engine
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Super Computers
Super ComputersSuper Computers
Super Computers
 

Mais de Karen Pao

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15Karen Pao
 
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Karen Pao
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15Karen Pao
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15Karen Pao
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15Karen Pao
 
Austin_SIAMCSE15
Austin_SIAMCSE15Austin_SIAMCSE15
Austin_SIAMCSE15Karen Pao
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15Karen Pao
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15Karen Pao
 

Mais de Karen Pao (8)

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
 
Austin_SIAMCSE15
Austin_SIAMCSE15Austin_SIAMCSE15
Austin_SIAMCSE15
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15
 

Último

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 

Último (20)

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 

Loffeld_SIAMCSE15

  • 1. LLNL-PRES-668437 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC High Performance of Finite-Volume Methods through Increased Arithmetic Intensity SIAM CSE 2015 J. Loffeld and J.A.F. Hittinger 3/17/2015
  • 2. Lawrence Livermore National Laboratory LLNL-PRES-668437 2 To get high flops rate, you need high arithmetic intensity 1024 512 256 128 64 32 16 Performance(GFlop/s) 321684211/2 Arithmetic Intensity (flop/byte) Machine peak Machinebalance No FMA No AVX Low-order PDE Stencils FFTs Dense Matrix Multiply Greaterconcurrency Increasing order of FV methods improves AI
  • 3. Lawrence Livermore National Laboratory LLNL-PRES-668437 3 Higher-order finite-volume methods require high-order flux approximations  Update formula for conservation laws: • We are considering AI for one time step  Approximating the flux averages gives FV method • High-order approximations give high-order method High-order flux approximations use more flopsFlux update stencil
  • 4. Lawrence Livermore National Laboratory LLNL-PRES-668437 4 High-order flux approximations include more neighbor information Eighth-order central flux: Incorporating information from neighbors gives high flop count  Derived upwind and central high-order schemes for 5th through 8th order  [McCorquodale, et al. CAMCS (2011)], [Colella, et al. J. Comput. Phys. (2011)]
  • 5. Lawrence Livermore National Laboratory LLNL-PRES-668437 5 AI is most easily calculated in the limit of infinite cache size  Assume unlimited cache • Useful for later refinement • Target AI  Load and store data only once per cell  Temporaries between stencils absorbed by cache  Re-use of data allows high AI
  • 6. Lawrence Livermore National Laboratory LLNL-PRES-668437 6 Theoretical maximum AI reaches target for sixth and eighth order Modern machine balance We would see these results in practice if machines had infinite cache space Flops for example step: 𝑐 8 𝐷 − 1 2 + 1 𝐷(𝑁 + 1)(𝑁 + 2) 𝐷−1  Formulas parameterized by • Dimension • Number of components • Domain size • Flops cost of flux function
  • 7. Lawrence Livermore National Laboratory LLNL-PRES-668437 7 In reality, machines have finite-size caches  Overhead from re- fetching halo cells • Sixth order halo width is 4 • Eighth order halo width is 6 • Halo cells limit minimum block size Each block stores values per component
  • 8. Lawrence Livermore National Laboratory LLNL-PRES-668437 8  Vulcan – IBM Blue Gene Q • 32MB L2 cache (last level) • Cache line is 128 bytes  BGPM for hardware counters • Flops counts are highly accurate • Overcount DRAM transfers — Turn off prefetching — Overhead from API — Get aliasing error from large cache line — Random noise To verify the predictions, we used the hardware counters on Vulcan Machine peak Machine balance 4.8 flop/byte 205 Gflop/s
  • 9. Lawrence Livermore National Laboratory LLNL-PRES-668437 9 Measured AI with ND cache blocking compares well to theory Modern machine balance • Higher order methods have wider stencils • Blocks need wide halos • Less efficient cache reuse Fourth theoretical Fourth measured Sixth theoretical Sixth measured Eighth theoretical Eighth measured
  • 10. Lawrence Livermore National Laboratory LLNL-PRES-668437 10 Because of halo, 3D blocking requires too much cache space  Need block length about 32 to keep overhead modest • For eighth-order, 1.55x • For sixth-order, 1.34x  Each block requires  For 5-component system (e.g. Euler), need 5 MB cache per 32-wide block • Current processors have ~2 to 2.5 MB/core On 1283 size domain
  • 11. Lawrence Livermore National Laboratory LLNL-PRES-668437 11 However, vertical iteration of rectangular cache blocks can improve cache usage  Successively evaluate blocks in columns • No re-fetching of halo in z direction  Storage per block:  For 8 × 322 blocks in 1283 size domain: Order Overhead Size AI 6 1.21x 1.5MB 13.6 8 1.33x 2.1MB 21.8 High AI with realistic cache size
  • 12. Lawrence Livermore National Laboratory LLNL-PRES-668437 12  Derived high-order finite-volume schemes  Conducted AI analysis that shows high AI can be obtained with realistic cache sizes Summary: Machine peak Machine balance Current and future work:  AI is an important metric for on-node utilization, but it does not equal performance • Latency, concurrency, cache blocking • [Olschanowski et al. SC (2014)] for 4th order  Need to consider ways to reduce halo width to further reduce overhead  Include nonlinear limiting in the flux AI analysis • Will further increase ops without increasing data transfers

Notas do Editor

  1. Kernels are memory bound. Cores are 90% idle. Roofline model explains phenomenon. Relates performance to AI and CPU features.