SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
www.anl.gov
Argonne Leadership Computing Facility
IWOCL, Apr. 28, 2020
Preparing to Program Aurora at
Exascale
Hal Finkel, et al.
Scientifc
Supercomputing
Computing for large, tightly-coupled problems.
Lots of computational
capability paired with
lots of high-performance memory. High computational density paired with a
high-throughput low-latency network.
What is (traditional) supercomputing?
https://www.alcf.anl.gov/files/alcfscibro2015.pdf
Many Scientifc Domains
http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
Common Algorithm Classes in HPC
http://crd.lbl.gov/assets/pubs_presos/CDS/ATG/WassermanSOTON.pdf
Common Algorithm Classes in HPC
Upcoming Hardware
http://www.nextplatform.com/2015/11/30/inside-future-knights-landing-xeon-phi-systems/
https://forum.beyond3d.com/threads/nvidia-pascal-speculation-thread.55552/page-4
“Many Core” CPUs
GPUs
All of our upcoming systems use GPUs!
Toward The Future of Supercomputing
9
(https://science.osti.gov/-/media/ascr/ascac/pdf/meetings/201909/20190923_ASCAC-Helland-Barbara-Helland.pdf)
Upcoming Systems
10
Aurora: A High-level View
 Intel-Cray machine arriving at Argonne in 2021
 Sustained Performance > 1Exafoos
 Intel Xeon orocessors and Intel Xe GPUs
 2 Xeons (Saoohire Raoids)
 6 GPUs (Ponte Vecchio [PVC])
 Greater than 10 PB of total memory
 Cray Slingshot fabric and Shasta olatform
 Filesystem
 Distributed Asynchronous Object Store (DAOS)
 ≥ 230 PB of storage caoacity
 Bandwidth of > 25 TB/s
 Lustre
 150 PB of storage caoacity
 Bandwidth of ~1TB/s
11
Aurora Compute Node
2 Intel Xeon (Saoohire Raoids)
orocessors
6 Xe Architecture based GPUs
(Ponte Vecchio)
 All to all connection
 Low latency and high bandwidth
8 Slingshot Fabric endooints
Unifed Memory Architecture
across CPUs and GPUs
Unified Memory and
GPU ↔ GPU connectivity…
Important implications for the
programming model!
Programming Models
(for Aurora)
13
Three Pillars
SimulatinSimulatin DataData LearningLearning
DirectvesDirectves
Parallel RuntmesParallel Runtmes
Silver LibrariesSilver Libraries
HPC LanguagesHPC Languages
Big Data StackBig Data Stack
Statstcal LibrariesStatstcal Libraries
Priductvity LanguagesPriductvity Languages
DatabasesDatabases
DL FramewirksDL Framewirks
Linear Algebra LibrariesLinear Algebra Libraries
Statstcal LibrariesStatstcal Libraries
Priductvity LanguagesPriductvity Languages
Math Libraries, C++ Standard Library, libcMath Libraries, C++ Standard Library, libc
I/O, MessagingI/O, Messaging
SchedulerScheduler
Linux Kernel, POSIXLinux Kernel, POSIX
Cimpilers, Perfirmance Tiils, DebuggersCimpilers, Perfirmance Tiils, Debuggers
Cintainers, VisualizatinCintainers, Visualizatin
14
MPI on Aurora
• Intel MPI & Cray MPI
• MPI 3.0 standard comoliant
• The MPI library will be thread safe
• Allow aoolications to use MPI from individual threads
• Efcient MPIHTꔰREADHMUTIPLE (locking ootimitations)
• Asynchronous orogress in all tyoes of nonblocking communication
• Nonblocking send-receive and collectives
• One-sided ooerations
• ꔰardware and tooology ootimited collective imolementations
• Suooorts MPI tools interface
• Control variables
MPICꔰ
Cꔰ4
libfabric
Slingshot
orovider
ꔰardware
OFI
MPICꔰ
Cꔰ4
libfabric
Slingshot
orovider
ꔰardware
OFI
15
Intel Fortran for Aurora
Fortran 2008
OoenMP 5
A signifcant amount of the code run on oresent day
machines is written in Fortran.
Most new code develooment seems to have shifted to
other languages (mainly C++).
16
oneAPI
Industry soecifcation from Intel (
httos://www.oneaoi.com/soec/)
 Language and libraries to target orogramming across
diverse architectures (DPC++, APIs, low level interface)
Intel oneAPI oroducts and toolkits (
httos://software.intel.com/ONEAPI)
 Imolementations of the oneAPI soecifcation and
analysis and debug tools to helo orogramming
17
Intel MKL – Math Kernel Library
ꔰighly tuned algorithms
 FFT
 Linear algebra (BLAS, LAPACK)
 Soarse solvers
 Statistical functions
 Vector math
 Random number generators
Ootimited for every Intel olatform
oneAPI MKL (oneMKL)
 httos://software.intel.com/en-us/oneaoi/mkl
oneAPI beta includes
DPC++ support
18
AI and Analytics
Libraries to suooort AI and Analytics
 OneAPI Deeo Neural Network Library (oneDNN)
 ꔰigh Performance Primitives to accelerate deeo learning frameworks
 Powers Tensorfow, PyTorch, MXNet, Intel Cafe, and more
 Running on Gen9 today (via OoenCL)
 oneAPI Data Analytics Library (oneDAL)
 Classical Machine Learning Algorithms
 Easy to use one line daal4oy Python interfaces
 Powers Scikit-Learn
 Aoache Soark MLlib
19
Heterogenous System Programming Models
Aoolications will be using a variety of orogramming models for Exascale:
 CUDA
 OoenCL
 ꔰIP
 OoenACC
 OoenMP
 DPC++/SYCL
 Kokkos
 Raja
Not all systems will suooort all models
Libraries may helo you abstract some orogramming models.
20
OpenMP 5
OoenMP 5 constructs will orovide directives based orogramming model for Intel GPUs
Available for C, C++, and Fortran
A oortable model exoected to be suooorted on a variety of olatforms (Aurora, Frontier,
Perlmutter, …)
Ootimited for Aurora
For Aurora, OoenACC codes could be converted into OoenMP
 ALCF staf will assist with conversion, training, and best oractices
 Automated translation oossible through the clacc conversion tool (for C/C++)
https://wwwsipenmpsirg/
21
OpenMP 4.5/5: for Aurora
OoenMP 4.5/5 soecifcation has signifcant uodates to allow for imoroved suooort of
accelerator devices
Distributng iteratons of the loop to
threads
Ofoading code to run on accelerator Controlling data transfer between
devices
#pragma omp target [clause[[,] clause],…]
structured-block
#pragma omp declare target
declaratons-defniton-seq
#pragma omp declare variant*(variant-
func-id) clause new-line
functon defniton or declaraton
#pragma omp teams [clause[[,] clause],…]
structured-block
#pragma omp distribute [clause[[,] clause],
…]
for-loops
#pragma omp loop* [clause[[,] clause],…]
for-loops
map ([map-type:] list )
map-type:=allic | tifrim | frim | ti |
…
#pragma omp target data [clause[[,] clause],…]
structured-block
#pragma omp target update [clause[[,] clause],
…]
* denites OMP 5Envirinment variables
• Cintril default device thriugh
OMP_DEFAULT_DEVICE
• Cintril ifiad with
OMP_TARGET_OFFLOAD
Runtme suppirt riutnes:
• viid omp_set_default_device(int dev_num)
• int omp_get_default_device(viid)
• int omp_get_num_devices(viid)
• int omp_get_num_teams(viid)
22
DPC++ (Data Parallel C++) and SYCL
SYCL
 Khronos standard soecifcation
 SYCL is a C++ based abstraction layer (standard C++11)
 Builds on OoenCL concepts (but single-source)
 SYCL is designed to be as close to standard C++ as
possible
Current Imolementations of SYCL:
 ComouteCPP™ (www.codeolay.com)
 Intel SYCL (github.com/intel/llvm)
 triSYCL (github.com/triSYCL/triSYCL)
 hioSYCL (github.com/illuhad/hioSYCL)
 Runs on today’s CPUs and nVidia, AMD, Intel GPUs
SYCL 1.2.1 or later
C++11 or
later
23
DPC++ (Data Parallel C++) and SYCL
SYCL
 Khronos standard soecifcation
 SYCL is a C++ based abstraction layer (standard C++11)
 Builds on OoenCL concepts (but single-source)
 SYCL is designed to be as close to standard C++ as
possible
Current Imolementations of SYCL:
 ComouteCPP™ (www.codeolay.com)
 Intel SYCL (github.com/intel/llvm)
 triSYCL (github.com/triSYCL/triSYCL)
 hioSYCL (github.com/illuhad/hioSYCL)
 Runs on today’s CPUs and nVidia, AMD, Intel GPUs
DPC++
 Part of Intel oneAPI soecifcation
 Intel extension of SYCL to suooort new innovative features
 Incoroorates SYCL 1.2.1 soecifcation and Unifed Shared
Memory
 Add language or runtime extensions as needed to meet
user needs
Intel DPC++
SYCL 1.2.1 or later
C++11 or
later
Extensions Descripton
Unifed Shared
Memiry (USM)
defnes piinter-based memiry accesses and
management interfacess
In-irder queues
defnes simple in-irder semantcs fir queues,
ti simplify cimmin ciding patternss
Reductin
privides reductin abstractin ti the ND-
range firm if parallel_firs
Optinal lambda
name
remives requirement ti manually name
lambdas that defne kernelss
Subgriups
defnes a griuping if wirk-items within a
wirk-griups
Data fiw pipes
enables efcient First-In, First-Out (FIFO)
cimmunicatin (FPGA-inly)
httos://soec.oneaoi.com/oneAPI/Elements/docoo/docooHroot.html#extensions-table
24
OpenMP 5
Host DeviceTransfer data and
execution control
extern void init(float*, float*, int);
extern void output(float*, int);
void vec_mult(float*p, float*v1, float*v2, int N)
{
  int i;
  init(v1, v2, N);
  #pragma omp target teams distribute parallel for simd 
map(to: v1[0:N], v2[0:N]) map(from: p[0:N])
  for (i=0; i<N; i++)
  {
      p[i] = v1[i]*v2[i];
  }
  output(p, N);
}
Creates
teams if
threads
in the
target
device
Distributes iteratins ti the
threads, where each thread
uses SIMD parallelism
Cintrilling data
transfer
25
void foo( int *A ) {
default_selector selector; // Selectors determine which device to dispatch to.
{
queue myQueue(selector); // Create queue to submit work to, based on selector
// Wrap data in a sycl::buffer
buffer<cl_int, 1> bufferA(A, 1024);
 
myQueue.submit([&](handler& cgh) {
//Create an accessor for the sycl buffer.
auto writeResult = bufferA.get_access<access::mode::discard_write>(cgh);
 
// Kernel
cgh.parallel_for<class hello_world>(range<1>{1024}, [=](id<1> idx) {
writeResult[idx] = idx[0];
}); // End of the kernel function
}); // End of the queue commands
} // End of scope, wait for the queued work to stop.
...
}
Get a device
SYCL bufer
using hist
piinter
Kernel
Queue iut if
scipe
Data accessir
SYCL Examples
Host DeviceTransfer data and
execution control
Performance Portability
A performance-portable application...
1) Is Portable
2) Runs on diverse architectures with reasonable performance
Performance Portability
Science Problem
Choose Algorithms
Optimize Algorithms
Knowledge of
System Architecture
and Tools
Run high-performance code!
Implement and Test Algorithms
The Development Workfow?
Science Problem
Choose Algorithms
Optimize Algorithms
Knowledge of
System Architecture
and Tools
Run high-performance code!
Implement and Test Algorithms
The Development Workfow?
Science Problem
Choose Algorithms
For the Target Architectures
Optimize Algorithms Knowledge of
System Architecture
and Tools
Run high-performance code!
Implement and Test Algorithms
Trade-offs between:
●
Basis functions
●
Resolution
●
Lagrangian vs. Eulerian representations
●
Renormalization and regularization schemes
●
Solver techniques
●
Evolved vs computed degrees of freedom
●
And more…
Cannot be made by a compiler!
Real Workfow...
Does this mean that performance portability is impossible?
No, but it does mean that performance-portable
applications tend to be highly parameterizable.
Performance Portability is Possible!
http://llvm-hpc2-workshop.github.io/slides/Tian.pdf
In 2015, many codes use OpenMP
directly to express parallelism.
A minority of applications use
abstraction libraries
(TBB and Thrust on this chart)
On the Usage of Abstract Models
But this is changing…
●
We're seeing even greater adoption of OpenMP, but…
●
Many applications are not using OpenMP directly.
Abstraction libraries are gaining in popularity.
●
Well established libraries such as TBB and Thrust.
●
RAJA (https://github.com/LLNL/RAJA)
●
Kokkos (https://github.com/kokkos)
Use of C++ Lambdas.
Can use OpenMP and/or other compiler directives
under the hood, but probably DPC++/HIP/CUDA.
On the Usage of Abstract Models
And starting with C++17, the standard library has parallel algorithms
too...
// For example:
std::sort(std::execution::par_unseq, vec.begin(), vec.end()); // parallel and vectorized
On the Usage of Abstract Models
Why can't programmers just write the code optimally?
●
Because what is optimal is different on different architectures.
●
Because programmers use abstraction layers and may not be able to write the optimal code
directly:
in library1:
void foo() {
std::for_each(std::execution::par_unseq, vec1.begin(), vec1.end(), ...);
}
in library2:
void bar() {
std::for_each(std::execution::par_unseq, vec2.begin(), vec2.end(), ...);
}
foo(); bar();
Compiler Optimizations for Parallel Code...
void foo(double * restrict a, double * restrict b, etc.) {
#pragma omp parallel for
for (i = 0; i < n; ++i) {
a[i] = e[i]*(b[i]*c[i] + d[i]) + f[i];
m[i] = q[i]*(n[i]*o[i] + p[i]) + r[i];
}
}
void foo(double * restrict a, double * restrict b, etc.) {
#pragma omp parallel for
for (i = 0; i < n; ++i) {
a[i] = e[i]*(b[i]*c[i] + d[i]) + f[i];
}
#pragma omp parallel for
for (i = 0; i < n; ++i) {
m[i] = q[i]*(n[i]*o[i] + p[i]) + r[i];
}
}
Split the loop Or should we fuse instead?
Compiler Optimizations for Parallel Code...
void foo(double * restrict a, double * restrict b, etc.) {
#pragma omp parallel for
for (i = 0; i < n; ++i) {
a[i] = e[i]*(b[i]*c[i] + d[i]) + f[i];
}
#pragma omp parallel for
for (i = 0; i < n; ++i) {
m[i] = q[i]*(n[i]*o[i] + p[i]) + r[i];
}
}
void foo(double * restrict a, double * restrict b, etc.) {
#pragma omp parallel
{
#pragma omp for
for (i = 0; i < n; ++i) {
a[i] = e[i]*(b[i]*c[i] + d[i]) + f[i];
}
#pragma omp for
for (i = 0; i < n; ++i) {
m[i] = q[i]*(n[i]*o[i] + p[i]) + r[i];
}
}
}
(we might want to fuse
the parallel regions)
Compiler Optimizations for Parallel Code...
Rodinia - hotspot3D
./3D 512 8 100 ../data/hotspot3D/power_512x8 ../data/hotspot3D/temp_512x8
Intel core i9, 10 cores, 20 threads, 51 runs, with and without
● aa => alias attribute propagation
● ap => argument privatization
Compiler Understanding Parallelism: It Can Help
(Work by Johannes Doerfert,
see our IWOMP 2018 paper
Base Version
Compiler understands parallelism enough
to get better pointer aliasing results.
It is really hard for compilers to change memory layouts and generally determine
what memory is needed where. The Kokkos C++ library has memory placement and layout policies:
View<const double ***, Layout, Space , MemoryTraits<RandomAccess>> name (...);
https://trilinos.org/oldsite/events/trilinos_user_group_2013/presentations/2013-11-TUG-Kokkos-Tutorial.pdf
Constant random-access data might be put into
texture memory on a GPU, for example.
Using the right memory layout
and placement helps a lot!
Memory Layout and Placement
As you might imagine, nothing is perfect yet...
So Where Does This Leave Us?
OpenMP DPC++ Kokkos / RAJA
Language Simple directives have
yielded to complicated
directives
Modern C++, simple
cases will become
simpler over time
Modern C++
Default Execution
Model
Fork-Join Work Queue
(Probably better for
expressing scalable
parallelism)
Fork-Join
Compiler Optimization
Potential
High Low
(Dynamic work queue
obscures structure)
Medium
(Greatly depends on
underlying backend)
As you might imagine, nothing is perfect yet...
So Where Does This Leave Us?
OpenMP DPC++ Kokkos / RAJA
Integrate With Highly-
Parameterized Code
Low / Medium High High
Helps With Data
Layout
No No
(Not Yet)
Yes
Good Accelerator-to-
Accelerator Transfer /
Dispatch
No
(Not Yet)
No
(Not Yet)
No
(Not Yet)
Conclusion
●
Future supercomputers will continue to advance scientific progress in a variety of domains.
●
Applications will rely on high-performance libraries as well as parallel-programming models.
●
DPC++/SYCL will be a critical programming model on future HPC platforms.
●
We will continue to understand the extent to which compiler optimizations assist the
development of portably-performant applications vs. the ability to explicitly parameterize and
dynamically compose the implementations of algorithms.
●
Parallel programming models will continue to evolve: support for data layouts and less-host-
centric models will be explored.
Conclusions
44
Acknowledgements
Argonne Leadershio Comouting Facility and Comoutational Science
Division Staf
This research was suooorted by the Exascale Comouting Project (17-SC-
20-SC), a collaborative efort of two U.S. Deoartment of Energy
organitations (Ofce of Science and the National Nuclear Security
Administration) resoonsible for the olanning and oreoaration of a caoable
exascale ecosystem, including software, aoolications, hardware,
advanced system engineering and early testbed olatforms, in suooort of
the nation’s exascale comouting imoerative.
This research used resources of the Argonne Leadershio Comouting
Facility, which is a DOE Ofce of Science User Facility suooorted under
Contract DE-AC02-06Cꔰ11357.
Thank You

Mais conteúdo relacionado

Mais procurados

Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmapinside-BigData.com
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCinside-BigData.com
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsGanesan Narayanasamy
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systemsinside-BigData.com
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Michelle Holley
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnectinside-BigData.com
 
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for EnterprisesEnabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for EnterprisesMichelle Holley
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutionsinside-BigData.com
 

Mais procurados (20)

Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
 
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for EnterprisesEnabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
 

Semelhante a Preparing to program Aurora at Exascale - Early experiences and future directions

Virtual platform
Virtual platformVirtual platform
Virtual platformsean chen
 
Nt1310 Unit 5 Algorithm
Nt1310 Unit 5 AlgorithmNt1310 Unit 5 Algorithm
Nt1310 Unit 5 AlgorithmAngie Lee
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_finalYutaka Kawai
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Pradeep Singh
 
Introduction to FPGA, VHDL
Introduction to FPGA, VHDL  Introduction to FPGA, VHDL
Introduction to FPGA, VHDL Amr Rashed
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationVEDLIoT Project
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxVivek Kumar
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) ArchitectureNFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) ArchitectureMichelle Holley
 
ONNC - 0.9.1 release
ONNC - 0.9.1 releaseONNC - 0.9.1 release
ONNC - 0.9.1 releaseLuba Tang
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAGanesan Narayanasamy
 
SDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual NetworkSDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual NetworkTim4PreStartup
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.pptAtef46
 
OSGi: Best Tool In Your Embedded Systems Toolbox
OSGi: Best Tool In Your Embedded Systems ToolboxOSGi: Best Tool In Your Embedded Systems Toolbox
OSGi: Best Tool In Your Embedded Systems ToolboxBrett Hackleman
 

Semelhante a Preparing to program Aurora at Exascale - Early experiences and future directions (20)

Virtual platform
Virtual platformVirtual platform
Virtual platform
 
Nt1310 Unit 5 Algorithm
Nt1310 Unit 5 AlgorithmNt1310 Unit 5 Algorithm
Nt1310 Unit 5 Algorithm
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
 
Introduction to FPGA, VHDL
Introduction to FPGA, VHDL  Introduction to FPGA, VHDL
Introduction to FPGA, VHDL
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Multicore
MulticoreMulticore
Multicore
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
A Common Backend for Hardware Acceleration of DSLs on FPGA
A Common Backend for Hardware Acceleration of DSLs on FPGAA Common Backend for Hardware Acceleration of DSLs on FPGA
A Common Backend for Hardware Acceleration of DSLs on FPGA
 
High speed I/O
High speed I/OHigh speed I/O
High speed I/O
 
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) ArchitectureNFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
NFV and SDN: 4G LTE and 5G Wireless Networks on Intel(r) Architecture
 
ONNC - 0.9.1 release
ONNC - 0.9.1 releaseONNC - 0.9.1 release
ONNC - 0.9.1 release
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGA
 
SDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual NetworkSDN, OpenFlow, NFV, and Virtual Network
SDN, OpenFlow, NFV, and Virtual Network
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.ppt
 
OSGi: Best Tool In Your Embedded Systems Toolbox
OSGi: Best Tool In Your Embedded Systems ToolboxOSGi: Best Tool In Your Embedded Systems Toolbox
OSGi: Best Tool In Your Embedded Systems Toolbox
 

Mais de inside-BigData.com

Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computinginside-BigData.com
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)inside-BigData.com
 
DGX SuperPOD: Instant Infrastructure for AI Leadership
DGX SuperPOD: Instant Infrastructure for AI LeadershipDGX SuperPOD: Instant Infrastructure for AI Leadership
DGX SuperPOD: Instant Infrastructure for AI Leadershipinside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
 
FPGAs and Machine Learning
FPGAs and Machine LearningFPGAs and Machine Learning
FPGAs and Machine Learning
 
Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)Deep Learning State of the Art (2020)
Deep Learning State of the Art (2020)
 
DGX SuperPOD: Instant Infrastructure for AI Leadership
DGX SuperPOD: Instant Infrastructure for AI LeadershipDGX SuperPOD: Instant Infrastructure for AI Leadership
DGX SuperPOD: Instant Infrastructure for AI Leadership
 

Último

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Preparing to program Aurora at Exascale - Early experiences and future directions

  • 1. www.anl.gov Argonne Leadership Computing Facility IWOCL, Apr. 28, 2020 Preparing to Program Aurora at Exascale Hal Finkel, et al.
  • 3. Computing for large, tightly-coupled problems. Lots of computational capability paired with lots of high-performance memory. High computational density paired with a high-throughput low-latency network. What is (traditional) supercomputing?
  • 10. 10 Aurora: A High-level View  Intel-Cray machine arriving at Argonne in 2021  Sustained Performance > 1Exafoos  Intel Xeon orocessors and Intel Xe GPUs  2 Xeons (Saoohire Raoids)  6 GPUs (Ponte Vecchio [PVC])  Greater than 10 PB of total memory  Cray Slingshot fabric and Shasta olatform  Filesystem  Distributed Asynchronous Object Store (DAOS)  ≥ 230 PB of storage caoacity  Bandwidth of > 25 TB/s  Lustre  150 PB of storage caoacity  Bandwidth of ~1TB/s
  • 11. 11 Aurora Compute Node 2 Intel Xeon (Saoohire Raoids) orocessors 6 Xe Architecture based GPUs (Ponte Vecchio)  All to all connection  Low latency and high bandwidth 8 Slingshot Fabric endooints Unifed Memory Architecture across CPUs and GPUs Unified Memory and GPU ↔ GPU connectivity… Important implications for the programming model!
  • 13. 13 Three Pillars SimulatinSimulatin DataData LearningLearning DirectvesDirectves Parallel RuntmesParallel Runtmes Silver LibrariesSilver Libraries HPC LanguagesHPC Languages Big Data StackBig Data Stack Statstcal LibrariesStatstcal Libraries Priductvity LanguagesPriductvity Languages DatabasesDatabases DL FramewirksDL Framewirks Linear Algebra LibrariesLinear Algebra Libraries Statstcal LibrariesStatstcal Libraries Priductvity LanguagesPriductvity Languages Math Libraries, C++ Standard Library, libcMath Libraries, C++ Standard Library, libc I/O, MessagingI/O, Messaging SchedulerScheduler Linux Kernel, POSIXLinux Kernel, POSIX Cimpilers, Perfirmance Tiils, DebuggersCimpilers, Perfirmance Tiils, Debuggers Cintainers, VisualizatinCintainers, Visualizatin
  • 14. 14 MPI on Aurora • Intel MPI & Cray MPI • MPI 3.0 standard comoliant • The MPI library will be thread safe • Allow aoolications to use MPI from individual threads • Efcient MPIHTꔰREADHMUTIPLE (locking ootimitations) • Asynchronous orogress in all tyoes of nonblocking communication • Nonblocking send-receive and collectives • One-sided ooerations • ꔰardware and tooology ootimited collective imolementations • Suooorts MPI tools interface • Control variables MPICꔰ Cꔰ4 libfabric Slingshot orovider ꔰardware OFI MPICꔰ Cꔰ4 libfabric Slingshot orovider ꔰardware OFI
  • 15. 15 Intel Fortran for Aurora Fortran 2008 OoenMP 5 A signifcant amount of the code run on oresent day machines is written in Fortran. Most new code develooment seems to have shifted to other languages (mainly C++).
  • 16. 16 oneAPI Industry soecifcation from Intel ( httos://www.oneaoi.com/soec/)  Language and libraries to target orogramming across diverse architectures (DPC++, APIs, low level interface) Intel oneAPI oroducts and toolkits ( httos://software.intel.com/ONEAPI)  Imolementations of the oneAPI soecifcation and analysis and debug tools to helo orogramming
  • 17. 17 Intel MKL – Math Kernel Library ꔰighly tuned algorithms  FFT  Linear algebra (BLAS, LAPACK)  Soarse solvers  Statistical functions  Vector math  Random number generators Ootimited for every Intel olatform oneAPI MKL (oneMKL)  httos://software.intel.com/en-us/oneaoi/mkl oneAPI beta includes DPC++ support
  • 18. 18 AI and Analytics Libraries to suooort AI and Analytics  OneAPI Deeo Neural Network Library (oneDNN)  ꔰigh Performance Primitives to accelerate deeo learning frameworks  Powers Tensorfow, PyTorch, MXNet, Intel Cafe, and more  Running on Gen9 today (via OoenCL)  oneAPI Data Analytics Library (oneDAL)  Classical Machine Learning Algorithms  Easy to use one line daal4oy Python interfaces  Powers Scikit-Learn  Aoache Soark MLlib
  • 19. 19 Heterogenous System Programming Models Aoolications will be using a variety of orogramming models for Exascale:  CUDA  OoenCL  ꔰIP  OoenACC  OoenMP  DPC++/SYCL  Kokkos  Raja Not all systems will suooort all models Libraries may helo you abstract some orogramming models.
  • 20. 20 OpenMP 5 OoenMP 5 constructs will orovide directives based orogramming model for Intel GPUs Available for C, C++, and Fortran A oortable model exoected to be suooorted on a variety of olatforms (Aurora, Frontier, Perlmutter, …) Ootimited for Aurora For Aurora, OoenACC codes could be converted into OoenMP  ALCF staf will assist with conversion, training, and best oractices  Automated translation oossible through the clacc conversion tool (for C/C++) https://wwwsipenmpsirg/
  • 21. 21 OpenMP 4.5/5: for Aurora OoenMP 4.5/5 soecifcation has signifcant uodates to allow for imoroved suooort of accelerator devices Distributng iteratons of the loop to threads Ofoading code to run on accelerator Controlling data transfer between devices #pragma omp target [clause[[,] clause],…] structured-block #pragma omp declare target declaratons-defniton-seq #pragma omp declare variant*(variant- func-id) clause new-line functon defniton or declaraton #pragma omp teams [clause[[,] clause],…] structured-block #pragma omp distribute [clause[[,] clause], …] for-loops #pragma omp loop* [clause[[,] clause],…] for-loops map ([map-type:] list ) map-type:=allic | tifrim | frim | ti | … #pragma omp target data [clause[[,] clause],…] structured-block #pragma omp target update [clause[[,] clause], …] * denites OMP 5Envirinment variables • Cintril default device thriugh OMP_DEFAULT_DEVICE • Cintril ifiad with OMP_TARGET_OFFLOAD Runtme suppirt riutnes: • viid omp_set_default_device(int dev_num) • int omp_get_default_device(viid) • int omp_get_num_devices(viid) • int omp_get_num_teams(viid)
  • 22. 22 DPC++ (Data Parallel C++) and SYCL SYCL  Khronos standard soecifcation  SYCL is a C++ based abstraction layer (standard C++11)  Builds on OoenCL concepts (but single-source)  SYCL is designed to be as close to standard C++ as possible Current Imolementations of SYCL:  ComouteCPP™ (www.codeolay.com)  Intel SYCL (github.com/intel/llvm)  triSYCL (github.com/triSYCL/triSYCL)  hioSYCL (github.com/illuhad/hioSYCL)  Runs on today’s CPUs and nVidia, AMD, Intel GPUs SYCL 1.2.1 or later C++11 or later
  • 23. 23 DPC++ (Data Parallel C++) and SYCL SYCL  Khronos standard soecifcation  SYCL is a C++ based abstraction layer (standard C++11)  Builds on OoenCL concepts (but single-source)  SYCL is designed to be as close to standard C++ as possible Current Imolementations of SYCL:  ComouteCPP™ (www.codeolay.com)  Intel SYCL (github.com/intel/llvm)  triSYCL (github.com/triSYCL/triSYCL)  hioSYCL (github.com/illuhad/hioSYCL)  Runs on today’s CPUs and nVidia, AMD, Intel GPUs DPC++  Part of Intel oneAPI soecifcation  Intel extension of SYCL to suooort new innovative features  Incoroorates SYCL 1.2.1 soecifcation and Unifed Shared Memory  Add language or runtime extensions as needed to meet user needs Intel DPC++ SYCL 1.2.1 or later C++11 or later Extensions Descripton Unifed Shared Memiry (USM) defnes piinter-based memiry accesses and management interfacess In-irder queues defnes simple in-irder semantcs fir queues, ti simplify cimmin ciding patternss Reductin privides reductin abstractin ti the ND- range firm if parallel_firs Optinal lambda name remives requirement ti manually name lambdas that defne kernelss Subgriups defnes a griuping if wirk-items within a wirk-griups Data fiw pipes enables efcient First-In, First-Out (FIFO) cimmunicatin (FPGA-inly) httos://soec.oneaoi.com/oneAPI/Elements/docoo/docooHroot.html#extensions-table
  • 24. 24 OpenMP 5 Host DeviceTransfer data and execution control extern void init(float*, float*, int); extern void output(float*, int); void vec_mult(float*p, float*v1, float*v2, int N) {   int i;   init(v1, v2, N);   #pragma omp target teams distribute parallel for simd map(to: v1[0:N], v2[0:N]) map(from: p[0:N])   for (i=0; i<N; i++)   {       p[i] = v1[i]*v2[i];   }   output(p, N); } Creates teams if threads in the target device Distributes iteratins ti the threads, where each thread uses SIMD parallelism Cintrilling data transfer
  • 25. 25 void foo( int *A ) { default_selector selector; // Selectors determine which device to dispatch to. { queue myQueue(selector); // Create queue to submit work to, based on selector // Wrap data in a sycl::buffer buffer<cl_int, 1> bufferA(A, 1024);   myQueue.submit([&](handler& cgh) { //Create an accessor for the sycl buffer. auto writeResult = bufferA.get_access<access::mode::discard_write>(cgh);   // Kernel cgh.parallel_for<class hello_world>(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; }); // End of the kernel function }); // End of the queue commands } // End of scope, wait for the queued work to stop. ... } Get a device SYCL bufer using hist piinter Kernel Queue iut if scipe Data accessir SYCL Examples Host DeviceTransfer data and execution control
  • 27. A performance-portable application... 1) Is Portable 2) Runs on diverse architectures with reasonable performance Performance Portability
  • 28. Science Problem Choose Algorithms Optimize Algorithms Knowledge of System Architecture and Tools Run high-performance code! Implement and Test Algorithms The Development Workfow?
  • 29. Science Problem Choose Algorithms Optimize Algorithms Knowledge of System Architecture and Tools Run high-performance code! Implement and Test Algorithms The Development Workfow?
  • 30. Science Problem Choose Algorithms For the Target Architectures Optimize Algorithms Knowledge of System Architecture and Tools Run high-performance code! Implement and Test Algorithms Trade-offs between: ● Basis functions ● Resolution ● Lagrangian vs. Eulerian representations ● Renormalization and regularization schemes ● Solver techniques ● Evolved vs computed degrees of freedom ● And more… Cannot be made by a compiler! Real Workfow...
  • 31. Does this mean that performance portability is impossible? No, but it does mean that performance-portable applications tend to be highly parameterizable. Performance Portability is Possible!
  • 32. http://llvm-hpc2-workshop.github.io/slides/Tian.pdf In 2015, many codes use OpenMP directly to express parallelism. A minority of applications use abstraction libraries (TBB and Thrust on this chart) On the Usage of Abstract Models
  • 33. But this is changing… ● We're seeing even greater adoption of OpenMP, but… ● Many applications are not using OpenMP directly. Abstraction libraries are gaining in popularity. ● Well established libraries such as TBB and Thrust. ● RAJA (https://github.com/LLNL/RAJA) ● Kokkos (https://github.com/kokkos) Use of C++ Lambdas. Can use OpenMP and/or other compiler directives under the hood, but probably DPC++/HIP/CUDA. On the Usage of Abstract Models
  • 34. And starting with C++17, the standard library has parallel algorithms too... // For example: std::sort(std::execution::par_unseq, vec.begin(), vec.end()); // parallel and vectorized On the Usage of Abstract Models
  • 35. Why can't programmers just write the code optimally? ● Because what is optimal is different on different architectures. ● Because programmers use abstraction layers and may not be able to write the optimal code directly: in library1: void foo() { std::for_each(std::execution::par_unseq, vec1.begin(), vec1.end(), ...); } in library2: void bar() { std::for_each(std::execution::par_unseq, vec2.begin(), vec2.end(), ...); } foo(); bar(); Compiler Optimizations for Parallel Code...
  • 36. void foo(double * restrict a, double * restrict b, etc.) { #pragma omp parallel for for (i = 0; i < n; ++i) { a[i] = e[i]*(b[i]*c[i] + d[i]) + f[i]; m[i] = q[i]*(n[i]*o[i] + p[i]) + r[i]; } } void foo(double * restrict a, double * restrict b, etc.) { #pragma omp parallel for for (i = 0; i < n; ++i) { a[i] = e[i]*(b[i]*c[i] + d[i]) + f[i]; } #pragma omp parallel for for (i = 0; i < n; ++i) { m[i] = q[i]*(n[i]*o[i] + p[i]) + r[i]; } } Split the loop Or should we fuse instead? Compiler Optimizations for Parallel Code...
  • 37. void foo(double * restrict a, double * restrict b, etc.) { #pragma omp parallel for for (i = 0; i < n; ++i) { a[i] = e[i]*(b[i]*c[i] + d[i]) + f[i]; } #pragma omp parallel for for (i = 0; i < n; ++i) { m[i] = q[i]*(n[i]*o[i] + p[i]) + r[i]; } } void foo(double * restrict a, double * restrict b, etc.) { #pragma omp parallel { #pragma omp for for (i = 0; i < n; ++i) { a[i] = e[i]*(b[i]*c[i] + d[i]) + f[i]; } #pragma omp for for (i = 0; i < n; ++i) { m[i] = q[i]*(n[i]*o[i] + p[i]) + r[i]; } } } (we might want to fuse the parallel regions) Compiler Optimizations for Parallel Code...
  • 38. Rodinia - hotspot3D ./3D 512 8 100 ../data/hotspot3D/power_512x8 ../data/hotspot3D/temp_512x8 Intel core i9, 10 cores, 20 threads, 51 runs, with and without ● aa => alias attribute propagation ● ap => argument privatization Compiler Understanding Parallelism: It Can Help (Work by Johannes Doerfert, see our IWOMP 2018 paper Base Version Compiler understands parallelism enough to get better pointer aliasing results.
  • 39. It is really hard for compilers to change memory layouts and generally determine what memory is needed where. The Kokkos C++ library has memory placement and layout policies: View<const double ***, Layout, Space , MemoryTraits<RandomAccess>> name (...); https://trilinos.org/oldsite/events/trilinos_user_group_2013/presentations/2013-11-TUG-Kokkos-Tutorial.pdf Constant random-access data might be put into texture memory on a GPU, for example. Using the right memory layout and placement helps a lot! Memory Layout and Placement
  • 40. As you might imagine, nothing is perfect yet... So Where Does This Leave Us? OpenMP DPC++ Kokkos / RAJA Language Simple directives have yielded to complicated directives Modern C++, simple cases will become simpler over time Modern C++ Default Execution Model Fork-Join Work Queue (Probably better for expressing scalable parallelism) Fork-Join Compiler Optimization Potential High Low (Dynamic work queue obscures structure) Medium (Greatly depends on underlying backend)
  • 41. As you might imagine, nothing is perfect yet... So Where Does This Leave Us? OpenMP DPC++ Kokkos / RAJA Integrate With Highly- Parameterized Code Low / Medium High High Helps With Data Layout No No (Not Yet) Yes Good Accelerator-to- Accelerator Transfer / Dispatch No (Not Yet) No (Not Yet) No (Not Yet)
  • 43. ● Future supercomputers will continue to advance scientific progress in a variety of domains. ● Applications will rely on high-performance libraries as well as parallel-programming models. ● DPC++/SYCL will be a critical programming model on future HPC platforms. ● We will continue to understand the extent to which compiler optimizations assist the development of portably-performant applications vs. the ability to explicitly parameterize and dynamically compose the implementations of algorithms. ● Parallel programming models will continue to evolve: support for data layouts and less-host- centric models will be explored. Conclusions
  • 44. 44 Acknowledgements Argonne Leadershio Comouting Facility and Comoutational Science Division Staf This research was suooorted by the Exascale Comouting Project (17-SC- 20-SC), a collaborative efort of two U.S. Deoartment of Energy organitations (Ofce of Science and the National Nuclear Security Administration) resoonsible for the olanning and oreoaration of a caoable exascale ecosystem, including software, aoolications, hardware, advanced system engineering and early testbed olatforms, in suooort of the nation’s exascale comouting imoerative. This research used resources of the Argonne Leadershio Comouting Facility, which is a DOE Ofce of Science User Facility suooorted under Contract DE-AC02-06Cꔰ11357.