SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
<https://www.mentor.com/embedded-software/>
Android is a trademark of Google Inc. Use of this trademark is subject to Google Permissions. Linux is the registered trademark of Linus Torvalds
in the U.S. and other countries.
Qt is a registered trade mark of Qt Company Oy. All other trademarks mentioned in this document are trademarks of their respective owners.
Speeding up Programs
with OpenACC in GCC
2019-02-03
Thomas Schwinge, <thomas@codesourcery.com>
Sourcery Services, Embedded Platform Solutions
Mentor, a Siemens Business
<https://fosdem.org/2019/schedule/event/openacc>
HPC, Big Data and Data Science devroom
FOSDEM 2019
... using the compute power of
GPUs and other accelerators
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
Introduction, Agenda
●
Proven in production use for decades, GCC (the GNU Compiler
Collection) offers C, C++, Fortran, and other compilers for a
multitude of target systems.
●
Over the last few years, we – formerly known as "CodeSourcery",
now a group in "Mentor, a Siemens Business" – added support for
the directive-based OpenACC programming model.
●
Requiring only few changes to your existing source code,
OpenACC allows for easy parallelization and code offloading to
GPUs and other accelerators.
●
We will present a short introduction of GCC and OpenACC,
implementation status, examples and performance results.
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
GCC (GNU Compiler Collection)
●
<https://gcc.gnu.org/>
●
<https://en.wikipedia.org/wiki/GNU_Compiler_Collection>
“The GNU Compiler Collection (GCC) is a compiler system
produced by the GNU Project supporting various programming
languages. GCC is a key component of the GNU toolchain and the
standard compiler for most Unix-like operating systems. [...]”
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
GCC (GNU Compiler Collection)
●
Production-quality support:
– C
– C++
– Fortran
●
GCC is relevant in HPC
See, for example, “results taken from the XALT database
at the National Institute of Computational Sciences (NICS)
covering a period from October 2014 through June 2015.
This data is from a Cray XC30 supercomputer called
Darter” as shown in Figure 2 of "Community Use of XALT
in Its First Year in Production," R. Budiardja, M. Fahey, R.
McLay, P. M. Don, B. Hadri, and D. James, In Proceedings
of the Second International Workshop on HPC User
Support Tools, HUST '15, Nov. 2015.
doi.acm.org/10.1145/2834996.2835000.
In his WACCPD 2018
keynote, Jack Wells (ORNL)
had some more recent data.
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
GPU architecture
Example: Nvidia GPU Kepler K20 (GK110)
Nvidia GPU Kepler
K20 (GK110), which
is a few years old,
but general concepts
are still the same
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
GPU architecture
Example: Nvidia GPU Kepler K20 (GK110)
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
GPU architecture
abstractly
<http://www.martinpeniak.com/images/cuda_fig2.png>
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
OpenACC
●
Base on your existing source code, few modifications required
●
Set of simple directives
– Similar to OpenMP
●
Easy for the user to mark up:
– Memory regions for data copy
– Parallel/vector loops
– Reduction operations
– Etc.
●
Provide hints to the compiler to better use available parallelism
●
Abstract enough to apply to a wide range of parallel architectures
<https://www.openacc.org/>
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
Example: matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
float t = 0;
for (k = 0; k < n; k++)
t += a[i][k] * b[k][j];
c[i][j] = t;
}}
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
Example: matrix multiplication
OpenACC parallel construct
#pragma acc parallel  // spawn parallel execution
copyin(a[0:n][0:n], b[0:n][0:n]) copyout(c[0:n][0:n]) // data copy
{
#pragma acc loop gang // multiprocessors
for (i = 0; i < n; i++) {
#pragma acc loop worker // groups of PTX “warps”
for (j = 0; j < n; j++) {
float t = 0;
#pragma acc loop vector  // “threads” in 32-size PTX “warps”
reduction(+: t) // “t” needs a “+” reduction operation
for (k = 0; k < n; k++)
t += a[i][k] * b[k][j];
c[i][j] = t;
}}}
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
Example: matrix multiplication
OpenACC kernels construct
#pragma acc kernels  // spawn parallel execution
copyin(a[0:n][0:n], b[0:n][0:n]) copyout(c[0:n][0:n]) // data copy
{
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
float t = 0;
for (k = 0; k < n; k++)
t += a[i][k] * b[k][j];
c[i][j] = t;
}}}
(But with GCC, the OpenACC kernels construct does not yet deliver the expected
performance.)
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
OpenACC support in GCC upstream
●
Most of OpenACC 2.0a, 2.5; 2.6 in development branch
●
OpenACC 2.7 not yet
●
Code offloading to:
– Nvidia GPUs (nvptx): upstream
– AMD GPUs (GCN): integrating upstream (GCC 10, 2020?)
– Any others could be done, too, including multi-threaded CPU
●
Host system
– Any with suitable drivers to talk to the accelerator
– We’re testing on x86_64 and ppc64le GNU/Linux
●
In you’re interested to help or fund development, talk to us :-)
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
Real-world application example: lsdalton
●
Program suite for “calculations of molecular properties”
●
Mixed C/some C++/a lot of Fortran source code
●
OpenACC directives in legacy Fortran code
– “The history of Dalton starts in fall of 1983 [...]”
●
Project: use GCC/OpenACC with Nvidia GPU for acceleration, then
tune performance
– Compare to PGI compiler
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
lsdalton: SLOCCount
SLOC Directory SLOC-by-Language (Sorted)
519483 src f90=349006,fortran=64312,ansic=63756,sh=41296,cpp=1113
343934 external ansic=149330,f90=89252,cpp=83139,python=11996,perl=9705,sh=475,pascal=37
Totals grouped by language (dominant language first):
F90: 438258 (50.76%) ansic: 213086 (24.68%) cpp: 84252 (9.76%) Fortran: 64312 (7.45%)
sh: 41771 (4.84%) python: 11996 (1.39%) Perl: 9705 (1.12%) pascal: 37 (0.00%)
Total Physical Source Lines of Code (SLOC) = 863,417 [for comparison: GCC w/o testsuites: ~3,500,000]
Development Effort Estimate, Person-Years (Person-Months) = 242.14 (2,905.65) (Basic COCOMO model, Person-Months = 2.4 *
(KSLOC**1.05))
Schedule Estimate, Years (Months) = 4.31 (51.76) (Basic COCOMO model, Months = 2.5 * (person-
months**0.38))
Estimated Average Number of Developers (Effort/Schedule) = 56.14
Total Estimated Cost to Develop = $ 32,709,451 (average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to redistribute it under certain conditions as specified by
the GNU GPL license; see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
lsdalton: build
●
This includes seven external Git submodules
●
Non-trivial build system
– Works for the PGI compiler
– Also for “stock” GCC (no acceleration)
– But not for GCC with OpenACC/nvptx offloading
●
Figure out how to use the desired GCC for all components, with -fopenacc enabled etc.
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
lsdalton: a few code changes required
●
Getting it to run with GCC with OpenACC/nvptx offloading
– Apples-to-apples comparison
●
Replace OpenACC kernels with OpenACC parallel constructs
●
Do not link against optimized Nvidia cuBLAS/PGI BLAS library (… for PGI compiler only)
– Replaced by Netlib Fortran BLAS functions, annotated with OpenACC directives
– Replace a few “questionable” uses of OpenACC directives with ones
supported by GCC
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
lsdalton: performance optimization
●
Several cycles of:
– Profile
– Analyze
– Tune
●
Teach GCC how to do “the right things”
– Did not alter lsdalton sources anymore
●
Infrequently report issues to Nvidia
– PTX to hardware JIT compilation
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
lsdalton: performance optimization
●
Eventually achieved great performance testing results for the
specific lsdalton configuration tested in this project
– https://blogs.mentor.com/embedded/blog/2018/06/06/evaluating-the-
performance-of-openacc-in-gcc/
– Optimizing GCC’s code generation also benefited other benchmarks
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
Real-world application example: n-body
●
Set of n individual bodies, each with initial position and velocity
●
Distance-dependent force between each pair
●
Problem is to calculate the trajectory of each body
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
Stock Ubuntu 18.04: GCC 8 with OpenACC/nvptx offloading support
$ sudo aptitude install gcc-8-offload-nvptx
The following NEW packages will be installed:
gcc-8-offload-nvptx libgomp-plugin-nvptx1{a} nvptx-tools{a}
0 packages upgraded, 3 newly installed, 0 to remove and 239 not upgraded.
Need to get 8057 kB of archives. After unpacking 54.4 MB will be used.
Do you want to continue? [Y/n/?] y
Get: 1 [...] nvptx-tools amd64 0.20180301-1 [27.8 kB]
Get: 2 [...] libgomp-plugin-nvptx1 amd64 8.2.0-1ubuntu2~18.04 [13.4 kB]
Get: 3 [...] gcc-8-offload-nvptx amd64 8.2.0-1ubuntu2~18.04 [8016 kB]
Fetched 8057 kB in 16s (501 kB/s)
[...]
Setting up nvptx-tools (0.20180301-1) ...
[...]
Setting up libgomp-plugin-nvptx1:amd64 (8.2.0-1ubuntu2~18.04) ...
Setting up gcc-8-offload-nvptx (8.2.0-1ubuntu2~18.04) ...
[...]
(Packaging work not done by us, but by Matthias "doko" Klose
who is maintainer of the Debian and Ubuntu GCC packages.)
Beware: these distribution packages are many months
behind our current (public!) development branch: in
terms of bug fixes, features, performance optimizations.
© Mentor, a Siemens Business
<https://www.mentor.com/embedded-software/>
Real-world application example: n-body
●
Live demo
– … on this 2013 laptop, with a powerful CPU, but mediocre GPU
© Mentor Graphics Corp. Company
Confidential
www.mentor.com/embedded
www.mentor.com/embedded
Thank you!

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

OpenACC Monthly Highlights March 2019
OpenACC Monthly Highlights March 2019OpenACC Monthly Highlights March 2019
OpenACC Monthly Highlights March 2019
 
OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020
 
OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
OpenACC Monthly Highlights: February 2021
OpenACC Monthly Highlights: February 2021OpenACC Monthly Highlights: February 2021
OpenACC Monthly Highlights: February 2021
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
 
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020
 
OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020
 
OpenACC Monthly Highlights September 2019
OpenACC Monthly Highlights September 2019OpenACC Monthly Highlights September 2019
OpenACC Monthly Highlights September 2019
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
OpenACC Highlights - February
OpenACC Highlights - FebruaryOpenACC Highlights - February
OpenACC Highlights - February
 
OpenACC Monthly Highlights April 2018
OpenACC Monthly Highlights April 2018OpenACC Monthly Highlights April 2018
OpenACC Monthly Highlights April 2018
 
OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
GPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierGPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next Frontier
 
GTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionGTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI Revolution
 

Semelhante a Speeding up Programs with OpenACC in GCC

Semelhante a Speeding up Programs with OpenACC in GCC (20)

SACON NY 19: "Creating an effective developer experience for cloud-native apps"
SACON NY 19: "Creating an effective developer experience for cloud-native apps"SACON NY 19: "Creating an effective developer experience for cloud-native apps"
SACON NY 19: "Creating an effective developer experience for cloud-native apps"
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2
 
Tensorflow in Docker
Tensorflow in DockerTensorflow in Docker
Tensorflow in Docker
 
Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"Velocity NY 2018 "The Cloud Native Developer Workflow"
Velocity NY 2018 "The Cloud Native Developer Workflow"
 
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
 
CloudNativeLondon 2018: "In Search of the Perfect Cloud Native Developer Expe...
CloudNativeLondon 2018: "In Search of the Perfect Cloud Native Developer Expe...CloudNativeLondon 2018: "In Search of the Perfect Cloud Native Developer Expe...
CloudNativeLondon 2018: "In Search of the Perfect Cloud Native Developer Expe...
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
microXchg 2019: "Creating an Effective Developer Experience for Cloud-Native ...
microXchg 2019: "Creating an Effective Developer Experience for Cloud-Native ...microXchg 2019: "Creating an Effective Developer Experience for Cloud-Native ...
microXchg 2019: "Creating an Effective Developer Experience for Cloud-Native ...
 
Cuda
CudaCuda
Cuda
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
Kubernetes to improve business scalability and processes (Cloud & DevOps Worl...
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R project
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
 

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Speeding up Programs with OpenACC in GCC

  • 1. <https://www.mentor.com/embedded-software/> Android is a trademark of Google Inc. Use of this trademark is subject to Google Permissions. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Qt is a registered trade mark of Qt Company Oy. All other trademarks mentioned in this document are trademarks of their respective owners. Speeding up Programs with OpenACC in GCC 2019-02-03 Thomas Schwinge, <thomas@codesourcery.com> Sourcery Services, Embedded Platform Solutions Mentor, a Siemens Business <https://fosdem.org/2019/schedule/event/openacc> HPC, Big Data and Data Science devroom FOSDEM 2019 ... using the compute power of GPUs and other accelerators
  • 2. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> Introduction, Agenda ● Proven in production use for decades, GCC (the GNU Compiler Collection) offers C, C++, Fortran, and other compilers for a multitude of target systems. ● Over the last few years, we – formerly known as "CodeSourcery", now a group in "Mentor, a Siemens Business" – added support for the directive-based OpenACC programming model. ● Requiring only few changes to your existing source code, OpenACC allows for easy parallelization and code offloading to GPUs and other accelerators. ● We will present a short introduction of GCC and OpenACC, implementation status, examples and performance results.
  • 3. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> GCC (GNU Compiler Collection) ● <https://gcc.gnu.org/> ● <https://en.wikipedia.org/wiki/GNU_Compiler_Collection> “The GNU Compiler Collection (GCC) is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain and the standard compiler for most Unix-like operating systems. [...]”
  • 4. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> GCC (GNU Compiler Collection) ● Production-quality support: – C – C++ – Fortran ● GCC is relevant in HPC See, for example, “results taken from the XALT database at the National Institute of Computational Sciences (NICS) covering a period from October 2014 through June 2015. This data is from a Cray XC30 supercomputer called Darter” as shown in Figure 2 of "Community Use of XALT in Its First Year in Production," R. Budiardja, M. Fahey, R. McLay, P. M. Don, B. Hadri, and D. James, In Proceedings of the Second International Workshop on HPC User Support Tools, HUST '15, Nov. 2015. doi.acm.org/10.1145/2834996.2835000. In his WACCPD 2018 keynote, Jack Wells (ORNL) had some more recent data.
  • 5. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> GPU architecture Example: Nvidia GPU Kepler K20 (GK110) Nvidia GPU Kepler K20 (GK110), which is a few years old, but general concepts are still the same NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  • 6. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> GPU architecture Example: Nvidia GPU Kepler K20 (GK110) NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  • 7. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> GPU architecture abstractly <http://www.martinpeniak.com/images/cuda_fig2.png>
  • 8. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> OpenACC ● Base on your existing source code, few modifications required ● Set of simple directives – Similar to OpenMP ● Easy for the user to mark up: – Memory regions for data copy – Parallel/vector loops – Reduction operations – Etc. ● Provide hints to the compiler to better use available parallelism ● Abstract enough to apply to a wide range of parallel architectures <https://www.openacc.org/>
  • 9. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> Example: matrix multiplication for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { float t = 0; for (k = 0; k < n; k++) t += a[i][k] * b[k][j]; c[i][j] = t; }}
  • 10. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> Example: matrix multiplication OpenACC parallel construct #pragma acc parallel // spawn parallel execution copyin(a[0:n][0:n], b[0:n][0:n]) copyout(c[0:n][0:n]) // data copy { #pragma acc loop gang // multiprocessors for (i = 0; i < n; i++) { #pragma acc loop worker // groups of PTX “warps” for (j = 0; j < n; j++) { float t = 0; #pragma acc loop vector // “threads” in 32-size PTX “warps” reduction(+: t) // “t” needs a “+” reduction operation for (k = 0; k < n; k++) t += a[i][k] * b[k][j]; c[i][j] = t; }}}
  • 11. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> Example: matrix multiplication OpenACC kernels construct #pragma acc kernels // spawn parallel execution copyin(a[0:n][0:n], b[0:n][0:n]) copyout(c[0:n][0:n]) // data copy { for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { float t = 0; for (k = 0; k < n; k++) t += a[i][k] * b[k][j]; c[i][j] = t; }}} (But with GCC, the OpenACC kernels construct does not yet deliver the expected performance.)
  • 12. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> OpenACC support in GCC upstream ● Most of OpenACC 2.0a, 2.5; 2.6 in development branch ● OpenACC 2.7 not yet ● Code offloading to: – Nvidia GPUs (nvptx): upstream – AMD GPUs (GCN): integrating upstream (GCC 10, 2020?) – Any others could be done, too, including multi-threaded CPU ● Host system – Any with suitable drivers to talk to the accelerator – We’re testing on x86_64 and ppc64le GNU/Linux ● In you’re interested to help or fund development, talk to us :-)
  • 13. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> Real-world application example: lsdalton ● Program suite for “calculations of molecular properties” ● Mixed C/some C++/a lot of Fortran source code ● OpenACC directives in legacy Fortran code – “The history of Dalton starts in fall of 1983 [...]” ● Project: use GCC/OpenACC with Nvidia GPU for acceleration, then tune performance – Compare to PGI compiler
  • 14. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> lsdalton: SLOCCount SLOC Directory SLOC-by-Language (Sorted) 519483 src f90=349006,fortran=64312,ansic=63756,sh=41296,cpp=1113 343934 external ansic=149330,f90=89252,cpp=83139,python=11996,perl=9705,sh=475,pascal=37 Totals grouped by language (dominant language first): F90: 438258 (50.76%) ansic: 213086 (24.68%) cpp: 84252 (9.76%) Fortran: 64312 (7.45%) sh: 41771 (4.84%) python: 11996 (1.39%) Perl: 9705 (1.12%) pascal: 37 (0.00%) Total Physical Source Lines of Code (SLOC) = 863,417 [for comparison: GCC w/o testsuites: ~3,500,000] Development Effort Estimate, Person-Years (Person-Months) = 242.14 (2,905.65) (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05)) Schedule Estimate, Years (Months) = 4.31 (51.76) (Basic COCOMO model, Months = 2.5 * (person- months**0.38)) Estimated Average Number of Developers (Effort/Schedule) = 56.14 Total Estimated Cost to Develop = $ 32,709,451 (average salary = $56,286/year, overhead = 2.40). SLOCCount, Copyright (C) 2001-2004 David A. Wheeler SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL. SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to redistribute it under certain conditions as specified by the GNU GPL license; see the documentation for details. Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."
  • 15. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> lsdalton: build ● This includes seven external Git submodules ● Non-trivial build system – Works for the PGI compiler – Also for “stock” GCC (no acceleration) – But not for GCC with OpenACC/nvptx offloading ● Figure out how to use the desired GCC for all components, with -fopenacc enabled etc.
  • 16. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> lsdalton: a few code changes required ● Getting it to run with GCC with OpenACC/nvptx offloading – Apples-to-apples comparison ● Replace OpenACC kernels with OpenACC parallel constructs ● Do not link against optimized Nvidia cuBLAS/PGI BLAS library (… for PGI compiler only) – Replaced by Netlib Fortran BLAS functions, annotated with OpenACC directives – Replace a few “questionable” uses of OpenACC directives with ones supported by GCC
  • 17. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> lsdalton: performance optimization ● Several cycles of: – Profile – Analyze – Tune ● Teach GCC how to do “the right things” – Did not alter lsdalton sources anymore ● Infrequently report issues to Nvidia – PTX to hardware JIT compilation
  • 18. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> lsdalton: performance optimization ● Eventually achieved great performance testing results for the specific lsdalton configuration tested in this project – https://blogs.mentor.com/embedded/blog/2018/06/06/evaluating-the- performance-of-openacc-in-gcc/ – Optimizing GCC’s code generation also benefited other benchmarks
  • 19. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> Real-world application example: n-body ● Set of n individual bodies, each with initial position and velocity ● Distance-dependent force between each pair ● Problem is to calculate the trajectory of each body
  • 20. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> Stock Ubuntu 18.04: GCC 8 with OpenACC/nvptx offloading support $ sudo aptitude install gcc-8-offload-nvptx The following NEW packages will be installed: gcc-8-offload-nvptx libgomp-plugin-nvptx1{a} nvptx-tools{a} 0 packages upgraded, 3 newly installed, 0 to remove and 239 not upgraded. Need to get 8057 kB of archives. After unpacking 54.4 MB will be used. Do you want to continue? [Y/n/?] y Get: 1 [...] nvptx-tools amd64 0.20180301-1 [27.8 kB] Get: 2 [...] libgomp-plugin-nvptx1 amd64 8.2.0-1ubuntu2~18.04 [13.4 kB] Get: 3 [...] gcc-8-offload-nvptx amd64 8.2.0-1ubuntu2~18.04 [8016 kB] Fetched 8057 kB in 16s (501 kB/s) [...] Setting up nvptx-tools (0.20180301-1) ... [...] Setting up libgomp-plugin-nvptx1:amd64 (8.2.0-1ubuntu2~18.04) ... Setting up gcc-8-offload-nvptx (8.2.0-1ubuntu2~18.04) ... [...] (Packaging work not done by us, but by Matthias "doko" Klose who is maintainer of the Debian and Ubuntu GCC packages.) Beware: these distribution packages are many months behind our current (public!) development branch: in terms of bug fixes, features, performance optimizations.
  • 21. © Mentor, a Siemens Business <https://www.mentor.com/embedded-software/> Real-world application example: n-body ● Live demo – … on this 2013 laptop, with a powerful CPU, but mediocre GPU
  • 22. © Mentor Graphics Corp. Company Confidential www.mentor.com/embedded www.mentor.com/embedded Thank you!