SlideShare uma empresa Scribd logo
1 de 35
spcl.inf.ethz.ch
@spcl_eth
Data-Centric Parallel Programming
Torsten Hoefler, Keynote at AsHES @ IPDPS’19, Rio, Brazil
Alexandros Ziogas, Tal Ben-Nun, Guillermo Indalecio, Timo Schneider, Mathieu Luisier, and Johannes de Fine Licht
and the whole DAPP team @ SPCL
https://eurompi19.inf.ethz.ch
spcl.inf.ethz.ch
@spcl_eth
2
Changing hardware constraints and the physics of computing
[1]: Marc Horowitz, Computing’s Energy Problem (and what we can do about it), ISSC 2014, plenary
[2]: Moore: Landauer Limit Demonstrated, IEEE Spectrum 2012
130nm
90nm
65nm
45nm
32nm
22nm
14nm
10nm
0.9 V [1]
32-bit FP ADD: 0.9 pJ
32-bit FP MUL: 3.2 pJ
2x32 bit from L1 (8 kiB): 10 pJ
2x32 bit from L2 (1 MiB): 100 pJ
2x32 bit from DRAM: 1.3 nJ
…
Three Ls of modern computing:
How to address locality challenges on standard architectures and programming?
D. Unat et al.: “Trends in Data Locality Abstractions for HPC Systems”
IEEE Transactions on Parallel and Distributed Systems (TPDS). Vol 28, Nr. 10, IEEE, Oct. 2017
spcl.inf.ethz.ch
@spcl_eth
3
Control in Load-store vs. Dataflow
Memory
Cache
RegistersControl
x=a+b
ld a, r1
ALU
ald b, r2 badd r1, r2
ba
x
bast r1, x Memory
+
c d y
y=(a+b)*(c+d)
a b
+
x
a b c d
a+b c+d
y
Turing Award 1977 (Backus): "Surely there must be a less primitive
way of making big changes in the store than pushing vast numbers
of words back and forth through the von Neumann bottleneck."
Load-store (“von Neumann”)
Energy per instruction: 70pJ
Source: Mark Horowitz, ISSC’14
Energy per operation: 1-3pJ
Static Dataflow (“non von Neumann”)
Very Low High
spcl.inf.ethz.ch
@spcl_eth
Still Lower
Control Locality
4
Single Instruction Multiple Data/Threads (SIMD - Vector CPU, SIMT - GPU)
Memory
Cache
RegistersControl
ALUALU
ALUALU
ALUALU
ALUALU
ALUALU
45nm, 0.9 V [1]
Random Access SRAM:
8 kiB: 10 pJ
32 kiB: 20 pJ
1 MiB: 100 pJ
Memory
+
c d ya b
+
x
a b c d
45nm, 0.9 V [1]
Single R/W registers:
32 bit: 0.1 pJ
[1]: Marc Horowitz, Computing’s Energy Problem (and what we can do about it), ISSC 2014, plenary
High Performance Computing really
became a data management challenge
spcl.inf.ethz.ch
@spcl_eth
5
Data movement will dominate everything!
Source: Fatollahi-Fard et al.
 “In future microprocessors, the energy expended for data movement will have a critical effect on
achievable performance.”
 “… movement consumes almost 58 watts with hardly any energy budget left for computation.”
 “…the cost of data movement starts to dominate.”
 “…data movement over these networks must be limited to conserve energy…”
 the phrase “data movement” appears 18 times on 11 pages (usually in concerning contexts)!
 “Efficient data orchestration will increasingly be critical, evolving to more efficient memory
hierarchies and new types of interconnect tailored for locality and that depend on
sophisticated software to place computation and data so as to minimize data movement.”
Source: NVIDIA
Source: Kogge, Shalf
spcl.inf.ethz.ch
@spcl_eth
 Well, to a good approximation how we programmed yesterday
 Or last year?
 Or four decades ago?
 Control-centric programming
 Worry about operation counts (flop/s is the metric, isn’t it?)
 Data movement is at best implicit (or invisible/ignored)
 Legion [1] is taking a good direction towards data-centric
 Tasking relies on data placement but not really dependencies (not visible to tool-chain)
 But it is still control-centric in the tasks – not (performance) portable between devices!
 Let’s go a step further towards an explicitly data-centric viewpoint
 For performance engineers at least!
6
“Sophisticated software”: How do we program today?
Backus ‘77: “The assignment statement
is the von Neumann bottleneck of programming
languages and keeps us thinking in word-at-a-time
terms in much the same way the computer’s
bottleneck does.”
[1]: Bauer et al.: “Legion: expressing locality and independence with logical regions”, SC12, 2012
spcl.inf.ethz.ch
@spcl_eth
7
Performance Portability with DataCentric (DaCe) Parallel Programming
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
SystemDomain Scientist Performance Engineer
High-Level Program
Data-Centric Intermediate
Representation (SDFG, §3)
𝜕𝑢
𝜕𝑡
− 𝛼𝛻2 𝑢 = 0
Problem Formulation
FPGA Modules
CPU Binary
Runtime
Hardware
Information
Graph Transformations
(API, Interactive, §4)
SDFG Compiler
Transformed
Dataflow
Performance
Results
Thin Runtime
Infrastructure
GPU Binary
Python /
NumPy
𝑳 𝑹
*
*
*
*
*
*
TensorFlow
DSLs
MATLAB
SDFG Builder API
spcl.inf.ethz.ch
@spcl_eth
8
DAPP – Data Centric Programming Concepts
• Store volatile (buffers, queues, RAM) and
nonvolatile (files, I/O) information
• Can be sources or sinks of data
• Stateless functions that perform computations at
any granularity
• Data access only through ports
• Data flowing between containers and tasklets/ports
• Implemented as access, copies, streaming, …
• Map scopes provide parallelism
• States constrain parallelism outside of datatflow
Data Containers Computation
Data Movement / Dependencies Parallelism and States
spcl.inf.ethz.ch
@spcl_eth
9
A first example in DaCe Python
spcl.inf.ethz.ch
@spcl_eth
DIODE User Interface
10
Source Code Transformations
SDFG
(malleable)
SDFGGenerated Code Performance
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
11
Performance for matrix multiplication on x86
SDFG
Naïve
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
12
Performance for matrix multiplication on x86
SDFG
MapReduceFusionNaïve
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
13
Performance for matrix multiplication on x86
SDFG
LoopReorder
MapReduceFusionNaïve
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
14
Performance for matrix multiplication on x86
SDFG
BlockTiling
LoopReorder
MapReduceFusionNaïve
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
15
Performance for matrix multiplication on x86
RegisterTiling
BlockTiling
LoopReorder
MapReduceFusionNaïve
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
16
Performance for matrix multiplication on x86
LocalStorage
RegisterTiling
BlockTiling
LoopReorder
MapReduceFusionNaïve
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
17
Performance for matrix multiplication on x86
PromoteTransient
LocalStorage
RegisterTiling
BlockTiling
LoopReorder
MapReduceFusionNaïve
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
18
Performance for matrix multiplication on x86
Intel MKL
OpenBLAS
25% difference
DAPP
With more tuning: 98.6% of MKLBut do we really care about MMM on x86 CPUs?
spcl.inf.ethz.ch
@spcl_eth
Hardware Mapping: Load/Store Architectures
 Recursive code generation (C++, CUDA)
 Control flow: Construct detection and gotos
 Parallelism
 Multi-core CPU: OpenMP, atomics, and threads
 GPU: CUDA kernels and streams
 Connected components run concurrently
 Memory and interaction with accelerators
 Array-array edges create intra-/inter-device copies
 Memory access validation on compilation
 Automatic CPU SDFG to GPU transformation
 Tasklet code immutable
19
spcl.inf.ethz.ch
@spcl_eth
Hardware Mapping: Pipelined Architectures
 Module generation with HDL and HLS
 Integration with Xilinx SDAccel
 Nested SDFGs become FPGA state machines
 Parallelism
 Exploiting temporal locality: Pipelines
 Exploiting spatial locality: Vectorization, replication
 Replication
 Enables parametric systolic array generation
 Memory access
 Burst memory access, vectorization
 Streams for inter-PE communication
20
spcl.inf.ethz.ch
@spcl_eth
Performance (Portability) Evaluation
 Three platforms:
 Intel Xeon E5-2650 v4 CPU (2.20 GHz, no HT)
 Tesla P100 GPU
 Xilinx VCU1525 hosting an XCVU9P FPGA
 Compilers and frameworks:
 Compilers:
GCC 8.2.0
Clang 6.0
icc 18.0.3
 Polyhedral optimizing compilers:
Polly 6.0
Pluto 0.11.4
PPCG 0.8
 GPU and FPGA compilers:
CUDA nvcc 9.2
Xilinx SDAccel 2018.2
 Frameworks and optimized libraries:
HPX
Halide
Intel MKL
NVIDIA CUBLAS, CUSPARSE, CUTLASS
NVIDIA CUB
21
spcl.inf.ethz.ch
@spcl_eth
Performance Evaluation: Fundamental Kernels (CPU)
 Database Query: roughly 50% of a 67,108,864 column
 Matrix Multiplication (MM): 2048x2048x2048
 Histogram: 8192x8192
 Jacobi stencil: 2048x2048 for T=1024
 Sparse Matrix-Vector Multiplication (SpMV): 8192x8192 CSR matrix (nnz=33,554,432)
22
99.9% of MKL8.12x faster 98.6% of MKL 2.5x faster 82.7% of Halide
spcl.inf.ethz.ch
@spcl_eth
Performance Evaluation: Fundamental Kernels (GPU, FPGA)
23
GPU
FPGA 309,000x
19.5x of Spatial
90% of CUTLASS
spcl.inf.ethz.ch
@spcl_eth
Performance Evaluation: Polybench (CPU)
 Polyhedral benchmark with 30 applications
 Without any transformations, achieves 1.43x (geometric mean) over
general-purpose compilers
24
spcl.inf.ethz.ch
@spcl_eth
Performance Evaluation: Polybench (GPU, FPGA)
 Automatically transformed from CPU code
25
GPU
(1.12x geomean speedup)
FPGA
The first full set of placed-and-routed Polybench
11.8x
spcl.inf.ethz.ch
@spcl_eth
Case Study: Parallel Breadth-First Search
 Compared with Galois and Gluon
 State-of-the-art graph processing frameworks on CPU
 Graphs:
 Road maps: USA, OSM-Europe
 Social networks: Twitter, LiveJournal
 Synthetic: Kronecker Graphs
26
Performance portability – fine, but who cares about microbenchmarks?
spcl.inf.ethz.ch
@spcl_eth
27
Remember the promise of DAPP – on to a real application!
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
SystemDomain Scientist Performance Engineer
High-Level Program
Data-Centric Intermediate
Representation (SDFG, §3)
𝜕𝑢
𝜕𝑡
− 𝛼𝛻2 𝑢 = 0
Problem Formulation
FPGA Modules
CPU Binary
Runtime
Hardware
Information
Graph Transformations
(API, Interactive, §4)
SDFG Compiler
Transformed
Dataflow
Performance
Results
Thin Runtime
Infrastructure
GPU Binary
Python /
NumPy
𝑳 𝑹
*
*
*
*
*
*
TensorFlow
DSLs
MATLAB
SDFG Builder API
spcl.inf.ethz.ch
@spcl_eth
28
Next-Generation Transistors need to be cooler – addressing self-heating
spcl.inf.ethz.ch
@spcl_eth
 OMEN Code (Luisier et al., Gordon Bell award finalist 2011 and 2015)
 90k SLOC, C, C++, CUDA, MPI, OpenMP, …
29
Quantum Transport Simulations with OMEN
Electrons 𝑮 𝑬, 𝒌 𝒛 Phonons 𝑫 𝝎, 𝒒 𝒛
GF
SSE
SSE
Σ 𝐺 𝐸 + ℏ𝜔, 𝑘 𝑧 − 𝑞 𝑧 𝐷 𝜔, 𝑞 𝑧 𝐸, 𝑘 𝑧
Π 𝐺 𝐸, 𝑘 𝑧 𝐺 𝐸 + ℏ𝜔, 𝑘 𝑧 + 𝑞 𝑧 𝜔, 𝑞 𝑧
𝐸 ⋅ 𝑆 − 𝐻 − Σ 𝑅
⋅ 𝐺 𝑅
= 𝐼
𝐺<
= 𝐺 𝑅
⋅ Σ<
⋅ 𝐺 𝐴
𝜔2
− Φ − Π 𝑅
⋅ 𝐷 𝑅
= 𝐼
𝐷<
= 𝐷 𝑅
⋅ Π<
⋅ 𝐷 𝐴
NEGF
spcl.inf.ethz.ch
@spcl_eth
30
All of OMEN (90k SLOC) in a single SDFG – (collapsed) tasklets contain more SDFGs
𝐻
𝑘 𝑧, 𝐸
RGF
Σ≷
convergence
𝐺≷
Φ
𝑞 𝑧, 𝜔
RGF
Π≷
𝐷≷
𝑏
𝛻𝐻
𝑘 𝑧, 𝐸, 𝑞 𝑧, 𝜔, 𝑎, 𝑏
SSE
Π≷
G≷
Σ≷
D≷
Not 𝑏
𝑏
GF
SSE
𝑖++𝑖=0 𝑞 𝑧, 𝜔𝑘 𝑧, 𝐸
𝐻[0:𝑁𝑘 𝑧
] Φ[0:𝑁𝑞 𝑧
]Σ≷
[0:𝑁𝑘 𝑧
,0:𝑁𝐸]
𝐼𝑒 𝐼 𝜙
Π≷
[0:𝑁𝑞 𝑧
,
1:𝑁 𝜔]
𝐻[𝑘 𝑧] Φ[𝑞 𝑧]Σ≷
[𝑘 𝑧,E] Π≷
[𝑞 𝑧,𝜔]
𝐺≷
[𝑘 𝑧,E] 𝐷≷
[𝑞 𝑧,𝜔]𝐼Φ (CR: Sum)
𝐼Φ (CR: Sum)𝐼e (CR: Sum)
𝐼e (CR: Sum)
𝐷≷
[0:N 𝑞 𝑧
,
1:N 𝜔]
G≷
[0:𝑁𝑘 𝑧
,0:𝑁𝐸]
𝛻𝐻 G≷
D≷
Π≷
(CR: Sum)Σ≷
(CR: Sum)
Σ≷
[…]
(CR: Sum)
Π≷
[…]
(CR: Sum)
𝛻𝐻[…] G≷
[…] D≷
[…]
𝑘 𝑧, 𝐸, 𝑞 𝑧, 𝜔, 𝑎, 𝑏
𝐼e 𝐼Φ
𝑏
spcl.inf.ethz.ch
@spcl_eth
31
Zooming into SSE (large share of the runtime)
DaCe
Transform
Between 100-250x less communication at scale! (from PB to TB)
spcl.inf.ethz.ch
@spcl_eth
32
Additional interesting performance insights
Python is slow! Ok, we knew that – but compiled can be fast!
Piz Daint single node (P100)
cuBLAS can be very inefficient (well, unless you floptimize)
Basic operation in SSE (many very small MMMs)
5k atoms
Piz Daint Summit
spcl.inf.ethz.ch
@spcl_eth
33
10,240 atoms on 27,360 V100 GPUs (full-scale Summit)
- 56 Pflop/s with I/O (28% peak)
Already ~100x speedup on 25%
of Summit – the original OMEN
does not scale further!
Communication time reduced
by 417x on Piz Daint!
Volume on full-scale Summit
from 12 PB/iter  87 TB/iter
spcl.inf.ethz.ch
@spcl_eth
34
An example of fine-grained data-centric optimization (i.e., how to vectorize)
Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
spcl.inf.ethz.ch
@spcl_eth
35
Overview and wrap-up
This project has received funding from the European Research Council (ERC) under grant agreement "DAPP (PI: T. Hoefler)".

Mais conteúdo relacionado

Mais procurados

Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerKarel Ha
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingSaliya Ekanayake
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoSciCompIIT
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Saliya Ekanayake
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learningAdam Gibson
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
 
Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture Intel® Software
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
 
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...LEGATO project
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationDevansh16
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...Christian Plessl
 

Mais procurados (20)

Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localization
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
 

Semelhante a Data-Centric Parallel Programming

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告Ryousei Takano
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programmingkhstandrews
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochgArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochgSri Ambati
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Masoud Nikravesh
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Intel® Software
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 

Semelhante a Data-Centric Parallel Programming (20)

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告IEEE CloudCom 2014参加報告
IEEE CloudCom 2014参加報告
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programming
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochgArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
ArnoCandelScalabledatascienceanddeeplearningwithh2o_gotochg
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Par com
Par comPar com
Par com
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Último (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Data-Centric Parallel Programming

  • 1. spcl.inf.ethz.ch @spcl_eth Data-Centric Parallel Programming Torsten Hoefler, Keynote at AsHES @ IPDPS’19, Rio, Brazil Alexandros Ziogas, Tal Ben-Nun, Guillermo Indalecio, Timo Schneider, Mathieu Luisier, and Johannes de Fine Licht and the whole DAPP team @ SPCL https://eurompi19.inf.ethz.ch
  • 2. spcl.inf.ethz.ch @spcl_eth 2 Changing hardware constraints and the physics of computing [1]: Marc Horowitz, Computing’s Energy Problem (and what we can do about it), ISSC 2014, plenary [2]: Moore: Landauer Limit Demonstrated, IEEE Spectrum 2012 130nm 90nm 65nm 45nm 32nm 22nm 14nm 10nm 0.9 V [1] 32-bit FP ADD: 0.9 pJ 32-bit FP MUL: 3.2 pJ 2x32 bit from L1 (8 kiB): 10 pJ 2x32 bit from L2 (1 MiB): 100 pJ 2x32 bit from DRAM: 1.3 nJ … Three Ls of modern computing: How to address locality challenges on standard architectures and programming? D. Unat et al.: “Trends in Data Locality Abstractions for HPC Systems” IEEE Transactions on Parallel and Distributed Systems (TPDS). Vol 28, Nr. 10, IEEE, Oct. 2017
  • 3. spcl.inf.ethz.ch @spcl_eth 3 Control in Load-store vs. Dataflow Memory Cache RegistersControl x=a+b ld a, r1 ALU ald b, r2 badd r1, r2 ba x bast r1, x Memory + c d y y=(a+b)*(c+d) a b + x a b c d a+b c+d y Turing Award 1977 (Backus): "Surely there must be a less primitive way of making big changes in the store than pushing vast numbers of words back and forth through the von Neumann bottleneck." Load-store (“von Neumann”) Energy per instruction: 70pJ Source: Mark Horowitz, ISSC’14 Energy per operation: 1-3pJ Static Dataflow (“non von Neumann”) Very Low High
  • 4. spcl.inf.ethz.ch @spcl_eth Still Lower Control Locality 4 Single Instruction Multiple Data/Threads (SIMD - Vector CPU, SIMT - GPU) Memory Cache RegistersControl ALUALU ALUALU ALUALU ALUALU ALUALU 45nm, 0.9 V [1] Random Access SRAM: 8 kiB: 10 pJ 32 kiB: 20 pJ 1 MiB: 100 pJ Memory + c d ya b + x a b c d 45nm, 0.9 V [1] Single R/W registers: 32 bit: 0.1 pJ [1]: Marc Horowitz, Computing’s Energy Problem (and what we can do about it), ISSC 2014, plenary High Performance Computing really became a data management challenge
  • 5. spcl.inf.ethz.ch @spcl_eth 5 Data movement will dominate everything! Source: Fatollahi-Fard et al.  “In future microprocessors, the energy expended for data movement will have a critical effect on achievable performance.”  “… movement consumes almost 58 watts with hardly any energy budget left for computation.”  “…the cost of data movement starts to dominate.”  “…data movement over these networks must be limited to conserve energy…”  the phrase “data movement” appears 18 times on 11 pages (usually in concerning contexts)!  “Efficient data orchestration will increasingly be critical, evolving to more efficient memory hierarchies and new types of interconnect tailored for locality and that depend on sophisticated software to place computation and data so as to minimize data movement.” Source: NVIDIA Source: Kogge, Shalf
  • 6. spcl.inf.ethz.ch @spcl_eth  Well, to a good approximation how we programmed yesterday  Or last year?  Or four decades ago?  Control-centric programming  Worry about operation counts (flop/s is the metric, isn’t it?)  Data movement is at best implicit (or invisible/ignored)  Legion [1] is taking a good direction towards data-centric  Tasking relies on data placement but not really dependencies (not visible to tool-chain)  But it is still control-centric in the tasks – not (performance) portable between devices!  Let’s go a step further towards an explicitly data-centric viewpoint  For performance engineers at least! 6 “Sophisticated software”: How do we program today? Backus ‘77: “The assignment statement is the von Neumann bottleneck of programming languages and keeps us thinking in word-at-a-time terms in much the same way the computer’s bottleneck does.” [1]: Bauer et al.: “Legion: expressing locality and independence with logical regions”, SC12, 2012
  • 7. spcl.inf.ethz.ch @spcl_eth 7 Performance Portability with DataCentric (DaCe) Parallel Programming Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs SystemDomain Scientist Performance Engineer High-Level Program Data-Centric Intermediate Representation (SDFG, §3) 𝜕𝑢 𝜕𝑡 − 𝛼𝛻2 𝑢 = 0 Problem Formulation FPGA Modules CPU Binary Runtime Hardware Information Graph Transformations (API, Interactive, §4) SDFG Compiler Transformed Dataflow Performance Results Thin Runtime Infrastructure GPU Binary Python / NumPy 𝑳 𝑹 * * * * * * TensorFlow DSLs MATLAB SDFG Builder API
  • 8. spcl.inf.ethz.ch @spcl_eth 8 DAPP – Data Centric Programming Concepts • Store volatile (buffers, queues, RAM) and nonvolatile (files, I/O) information • Can be sources or sinks of data • Stateless functions that perform computations at any granularity • Data access only through ports • Data flowing between containers and tasklets/ports • Implemented as access, copies, streaming, … • Map scopes provide parallelism • States constrain parallelism outside of datatflow Data Containers Computation Data Movement / Dependencies Parallelism and States
  • 10. spcl.inf.ethz.ch @spcl_eth DIODE User Interface 10 Source Code Transformations SDFG (malleable) SDFGGenerated Code Performance Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 11. spcl.inf.ethz.ch @spcl_eth 11 Performance for matrix multiplication on x86 SDFG Naïve Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 12. spcl.inf.ethz.ch @spcl_eth 12 Performance for matrix multiplication on x86 SDFG MapReduceFusionNaïve Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 13. spcl.inf.ethz.ch @spcl_eth 13 Performance for matrix multiplication on x86 SDFG LoopReorder MapReduceFusionNaïve Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 14. spcl.inf.ethz.ch @spcl_eth 14 Performance for matrix multiplication on x86 SDFG BlockTiling LoopReorder MapReduceFusionNaïve Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 15. spcl.inf.ethz.ch @spcl_eth 15 Performance for matrix multiplication on x86 RegisterTiling BlockTiling LoopReorder MapReduceFusionNaïve Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 16. spcl.inf.ethz.ch @spcl_eth 16 Performance for matrix multiplication on x86 LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusionNaïve Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 17. spcl.inf.ethz.ch @spcl_eth 17 Performance for matrix multiplication on x86 PromoteTransient LocalStorage RegisterTiling BlockTiling LoopReorder MapReduceFusionNaïve Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 18. spcl.inf.ethz.ch @spcl_eth Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs 18 Performance for matrix multiplication on x86 Intel MKL OpenBLAS 25% difference DAPP With more tuning: 98.6% of MKLBut do we really care about MMM on x86 CPUs?
  • 19. spcl.inf.ethz.ch @spcl_eth Hardware Mapping: Load/Store Architectures  Recursive code generation (C++, CUDA)  Control flow: Construct detection and gotos  Parallelism  Multi-core CPU: OpenMP, atomics, and threads  GPU: CUDA kernels and streams  Connected components run concurrently  Memory and interaction with accelerators  Array-array edges create intra-/inter-device copies  Memory access validation on compilation  Automatic CPU SDFG to GPU transformation  Tasklet code immutable 19
  • 20. spcl.inf.ethz.ch @spcl_eth Hardware Mapping: Pipelined Architectures  Module generation with HDL and HLS  Integration with Xilinx SDAccel  Nested SDFGs become FPGA state machines  Parallelism  Exploiting temporal locality: Pipelines  Exploiting spatial locality: Vectorization, replication  Replication  Enables parametric systolic array generation  Memory access  Burst memory access, vectorization  Streams for inter-PE communication 20
  • 21. spcl.inf.ethz.ch @spcl_eth Performance (Portability) Evaluation  Three platforms:  Intel Xeon E5-2650 v4 CPU (2.20 GHz, no HT)  Tesla P100 GPU  Xilinx VCU1525 hosting an XCVU9P FPGA  Compilers and frameworks:  Compilers: GCC 8.2.0 Clang 6.0 icc 18.0.3  Polyhedral optimizing compilers: Polly 6.0 Pluto 0.11.4 PPCG 0.8  GPU and FPGA compilers: CUDA nvcc 9.2 Xilinx SDAccel 2018.2  Frameworks and optimized libraries: HPX Halide Intel MKL NVIDIA CUBLAS, CUSPARSE, CUTLASS NVIDIA CUB 21
  • 22. spcl.inf.ethz.ch @spcl_eth Performance Evaluation: Fundamental Kernels (CPU)  Database Query: roughly 50% of a 67,108,864 column  Matrix Multiplication (MM): 2048x2048x2048  Histogram: 8192x8192  Jacobi stencil: 2048x2048 for T=1024  Sparse Matrix-Vector Multiplication (SpMV): 8192x8192 CSR matrix (nnz=33,554,432) 22 99.9% of MKL8.12x faster 98.6% of MKL 2.5x faster 82.7% of Halide
  • 23. spcl.inf.ethz.ch @spcl_eth Performance Evaluation: Fundamental Kernels (GPU, FPGA) 23 GPU FPGA 309,000x 19.5x of Spatial 90% of CUTLASS
  • 24. spcl.inf.ethz.ch @spcl_eth Performance Evaluation: Polybench (CPU)  Polyhedral benchmark with 30 applications  Without any transformations, achieves 1.43x (geometric mean) over general-purpose compilers 24
  • 25. spcl.inf.ethz.ch @spcl_eth Performance Evaluation: Polybench (GPU, FPGA)  Automatically transformed from CPU code 25 GPU (1.12x geomean speedup) FPGA The first full set of placed-and-routed Polybench 11.8x
  • 26. spcl.inf.ethz.ch @spcl_eth Case Study: Parallel Breadth-First Search  Compared with Galois and Gluon  State-of-the-art graph processing frameworks on CPU  Graphs:  Road maps: USA, OSM-Europe  Social networks: Twitter, LiveJournal  Synthetic: Kronecker Graphs 26 Performance portability – fine, but who cares about microbenchmarks?
  • 27. spcl.inf.ethz.ch @spcl_eth 27 Remember the promise of DAPP – on to a real application! Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs SystemDomain Scientist Performance Engineer High-Level Program Data-Centric Intermediate Representation (SDFG, §3) 𝜕𝑢 𝜕𝑡 − 𝛼𝛻2 𝑢 = 0 Problem Formulation FPGA Modules CPU Binary Runtime Hardware Information Graph Transformations (API, Interactive, §4) SDFG Compiler Transformed Dataflow Performance Results Thin Runtime Infrastructure GPU Binary Python / NumPy 𝑳 𝑹 * * * * * * TensorFlow DSLs MATLAB SDFG Builder API
  • 28. spcl.inf.ethz.ch @spcl_eth 28 Next-Generation Transistors need to be cooler – addressing self-heating
  • 29. spcl.inf.ethz.ch @spcl_eth  OMEN Code (Luisier et al., Gordon Bell award finalist 2011 and 2015)  90k SLOC, C, C++, CUDA, MPI, OpenMP, … 29 Quantum Transport Simulations with OMEN Electrons 𝑮 𝑬, 𝒌 𝒛 Phonons 𝑫 𝝎, 𝒒 𝒛 GF SSE SSE Σ 𝐺 𝐸 + ℏ𝜔, 𝑘 𝑧 − 𝑞 𝑧 𝐷 𝜔, 𝑞 𝑧 𝐸, 𝑘 𝑧 Π 𝐺 𝐸, 𝑘 𝑧 𝐺 𝐸 + ℏ𝜔, 𝑘 𝑧 + 𝑞 𝑧 𝜔, 𝑞 𝑧 𝐸 ⋅ 𝑆 − 𝐻 − Σ 𝑅 ⋅ 𝐺 𝑅 = 𝐼 𝐺< = 𝐺 𝑅 ⋅ Σ< ⋅ 𝐺 𝐴 𝜔2 − Φ − Π 𝑅 ⋅ 𝐷 𝑅 = 𝐼 𝐷< = 𝐷 𝑅 ⋅ Π< ⋅ 𝐷 𝐴 NEGF
  • 30. spcl.inf.ethz.ch @spcl_eth 30 All of OMEN (90k SLOC) in a single SDFG – (collapsed) tasklets contain more SDFGs 𝐻 𝑘 𝑧, 𝐸 RGF Σ≷ convergence 𝐺≷ Φ 𝑞 𝑧, 𝜔 RGF Π≷ 𝐷≷ 𝑏 𝛻𝐻 𝑘 𝑧, 𝐸, 𝑞 𝑧, 𝜔, 𝑎, 𝑏 SSE Π≷ G≷ Σ≷ D≷ Not 𝑏 𝑏 GF SSE 𝑖++𝑖=0 𝑞 𝑧, 𝜔𝑘 𝑧, 𝐸 𝐻[0:𝑁𝑘 𝑧 ] Φ[0:𝑁𝑞 𝑧 ]Σ≷ [0:𝑁𝑘 𝑧 ,0:𝑁𝐸] 𝐼𝑒 𝐼 𝜙 Π≷ [0:𝑁𝑞 𝑧 , 1:𝑁 𝜔] 𝐻[𝑘 𝑧] Φ[𝑞 𝑧]Σ≷ [𝑘 𝑧,E] Π≷ [𝑞 𝑧,𝜔] 𝐺≷ [𝑘 𝑧,E] 𝐷≷ [𝑞 𝑧,𝜔]𝐼Φ (CR: Sum) 𝐼Φ (CR: Sum)𝐼e (CR: Sum) 𝐼e (CR: Sum) 𝐷≷ [0:N 𝑞 𝑧 , 1:N 𝜔] G≷ [0:𝑁𝑘 𝑧 ,0:𝑁𝐸] 𝛻𝐻 G≷ D≷ Π≷ (CR: Sum)Σ≷ (CR: Sum) Σ≷ […] (CR: Sum) Π≷ […] (CR: Sum) 𝛻𝐻[…] G≷ […] D≷ […] 𝑘 𝑧, 𝐸, 𝑞 𝑧, 𝜔, 𝑎, 𝑏 𝐼e 𝐼Φ 𝑏
  • 31. spcl.inf.ethz.ch @spcl_eth 31 Zooming into SSE (large share of the runtime) DaCe Transform Between 100-250x less communication at scale! (from PB to TB)
  • 32. spcl.inf.ethz.ch @spcl_eth 32 Additional interesting performance insights Python is slow! Ok, we knew that – but compiled can be fast! Piz Daint single node (P100) cuBLAS can be very inefficient (well, unless you floptimize) Basic operation in SSE (many very small MMMs) 5k atoms Piz Daint Summit
  • 33. spcl.inf.ethz.ch @spcl_eth 33 10,240 atoms on 27,360 V100 GPUs (full-scale Summit) - 56 Pflop/s with I/O (28% peak) Already ~100x speedup on 25% of Summit – the original OMEN does not scale further! Communication time reduced by 417x on Piz Daint! Volume on full-scale Summit from 12 PB/iter  87 TB/iter
  • 34. spcl.inf.ethz.ch @spcl_eth 34 An example of fine-grained data-centric optimization (i.e., how to vectorize) Preprint (arXiv): Ben-Nun, de Fine Licht, Ziogas, TH: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
  • 35. spcl.inf.ethz.ch @spcl_eth 35 Overview and wrap-up This project has received funding from the European Research Council (ERC) under grant agreement "DAPP (PI: T. Hoefler)".

Notas do Editor

  1. JOHANNES: you said something about that scaling stops at 0nm which didn’t make a lot of sense. It’s like you also said, the scaling stops when you hit the size of individual molecules  Mark Horowitz talk: https://www.youtube.com/watch?v=7gGnRGko3go 45nm, 0.9V  32bit FADD: 0.9pJ, FMUL: 3.7pJ, 2 operand (64bit) load from L1 (8kiB): 10pJ, from L2 (1MiB): 100pJ, from DRAM: 1.3-2.6nJ [1] FP computation approaches Landauer limit (3e-21 J) [2] Assuming all bits are switched in 32 bit FP, ~0.2aJ (atto Joules = 10e-6pJ) Electrical data movement suffers from resistance Heat loss: ~ resistancy * length / cross-section-area  process shrinkage benefits computation energy but increases communication energy
  2. - Add explanation for Control Locality GPUs and CPUs reduce control overhead with SIMD/SIMT – if it’s 10x wide than we’re again op-dominated BUT memory problem remains, so we need large inefficient (due to addressing) caches, not so in Dataflow
  3. GPUs and CPUs reduce control overhead with SIMD/SIMT – if it’s 10x wide than we’re again op-dominated BUT memory problem remains, so we need large inefficient (due to addressing) caches, not so in Dataflow SIMD/SIMT is also less flexible --- but of course more efficient due to better silicon optimization
  4. IDE Plugin?
  5. Vector types for all architectures Weakly connected components
  6. Weakly connected components
  7. Thorough
  8. Keep in mind that PPCG is a specialized tool that does this transformation only for polyhedral applications.
  9. Highly irregular Around 3 transformations IIRC We have more applications in deep learning and more
  10. - Needs better communication models!!!
  11. - Needs better communication models!!!