Introduction to Parallelization ans performance optimization

Introduction to Parallelization
and performance optimization
Cristian Gomollon Escribano
17 / 03 / 2021

Why parallelize/optimize?
Sometimes you have a problem and (fortunately) you know how to solve it...

Sometimes you have a problem and (fortunately) you know how to solve it...
But.... you don't have enough resources to get it...

Initial situation: The typical computer consist in a CPU(~4 cores), a limited amount of
memory(~8Gb) and disc(1Tb), with a low efficiency ...
Solution proposal: More and better CPUs, more memory, more and faster disc/network!

I have my own code and I want to
parallelize/optimize its execution, how?

1º Analyse your code using a profiler
Initialization
Main loop
Finalization

Identify the section where your
code spends the most part of the
time using a profiler

Identify the section where your
code spends the most part of the
time using a profiler
Possible boundings
Compute
Memory
I/O
The profilers(and also tracers) also identify other types of
bottlenecks/overheads like bad memory alignment, cache faults
or bad compiler "pathways"

Note: Not the whole code is suitable for parallelization or
optimization. Also that "formulas" are idealisations. In the real
world, the "overheads"(parallel libraries/communication) have a
relevant impact on performance...
Variable setting and
I/O output ~1% time
Nested loop ~98%
Std output ~1%
t Total time
S Speedup
ts Serial code time
tp Parallellizable code time
N Number of cores
Ahmdal's law:
RunTime:
Timing results Some interesting metrics

2º Check if it is possible to parallelize/optimize that section
Typical, potentially parallel/efficient tasks:
Not so easy(but sometimes possible):
f = open("demofile.txt", "r")

Typical, potentially parallel/efficient tasks:
Not so easy(but sometimes possible):
f = open("demofile.txt", "r")
In general, the repetitive parts of the code (loops/math ops) are
the best suited for a parallelization/optimization strategy.
2º Check if it is possible to parallelize/optimize that section

3º Parallelization/Optimization strategies
Parallel/Accelerated libraries Parallel programming paradigms
Accelerator libraries

•Uninode/Multinode
•Distributed Memory and I/O
•Slightly recoding needed
•Network dependent
•mpirun –np 256 ./allrun
•Uninode
•Shared Memory
•Only requires “directives”
•OMP_NUM_THREADS=64 ./allrun
•Uninode/Multinode
•Distributed Memory and I/O
•Requires code rewriting.
•Strongly dependent on the workflow!
3º Parallelization/Optimization strategies(parallel programming)

3º Parallelization/Optimization strategies(parallel programming)

Linear algebra solvers Fourier Transform
cuFFT
Parallel I/O
3º Parallelization/Optimization strategies(parallel libraries)

In summary
Identify the section where
your code spends more time
using a profiler.
Determine if your code is
compute, memory and/or
I/O bounded.
Decide if you need a shared
memory, distributed memory/IO
paradigm (OpenMP or MPI), or
call a well-tested optimized
library.

Tips & Tricks: Programming hints(memory optimization)
//compute the sum of two arrays in parallel
#include < stdio.h >
#include < mpi.h >
#define N 1000000
int main(void) {
MPI_Init(NULL, NULL);
int world_size,world_rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int Ni=N/world_size; //Be carefull with the memory....
if (world_rank==0)
{float a[N], b[N], c[N];}
else{float a[Ni], b[Ni], c[Ni];}
int i;
/* Initialize arrays a and b */
for (i = 0; i < Ni; i++) {
a[i] = i * 2.0;
b[i] = i * 3.0;
}
/* Compute values of array c = a+b in parallel. */
for (i = 0; i < Ni; i++){
c[i] = a[i] + b[i]; }
MPI_Gather( a, Ni, MPI_Int, a, int recv_count, MPI_Int, 0,
MPI_COMM_WORLD);
MPI_Gather(b, Ni, MPI_Int, b, int recv_count, MPI_Int, 0,
MPI_COMM_WORLD);
MPI_Gather( c, Ni, MPI_Int,c, int recv_count, MPI_Int, 0,
MPI_COMM_WORLD);
MPI_Finalize();}
//compute the sum of two arrays in parallel
#include < stdio.h >
#include < omp.h >
#define N 1000000
int main(void) {
float a[N], b[N], c[N];
int i;
/* Initialize arrays a and b */
for (i = 0; i < N; i++) {
a[i] = i * 2.0;
b[i] = i * 3.0;
}
/* Compute values of array c = a+b in parallel. */
#pragma omp parallel shared(a, b, c) private(i)
{
#pragma omp for
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
}
}

Tips & Tricks: Programming hints(memory optimizations)
• Fortran and C have different memory alignments: Be sure that you are going over the
memory in the right way.

Is a good idea to transpose the 2nd matrix and multiply row by row in C (And the opposite in Fortran).
Tips & Tricks: Programming hints(memory optimizations)
In MPI paralelization, be carefull how to share the work between tasks to optimize the memory usage.
Rank 0
Rank 1
Rank 2
All ranks

• Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial dimensions,
a reduced number of orbitals...), invert the loop nesting order...
Serial RunTime: 14 s
Tips & Tricks: Programming hints(loop paralelization)

2 s Parallel RunTime
>1700 s Parallel RunTime
Serial RunTime: 14 s
Tips & Tricks: Programming hints(loop paralelization)
• Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial dimensions,
a reduced number of orbitals...), invert the loop nesting order...

Tips & Tricks: Programming hints(Synchronization)
Minimize, as much as possible, the synchronization overhead in MPI codes

• Fortran and C have different memory alignments: Be sure that you are multiplying
matrices in the right way. Also is a good idea to transpose the 2nd matrix before
multiplying.
• Avoid loops accessing to "pointers": This could confuse the compiler and reduce
dramatically the performance.
• Try to parallelize the "outer" loop: If the "outer" loop has few elements (3 spatial
dimensions, a reduced number of orbitals...), invert the loop nesting order...
• Unroll loops to minimieze jumps: Is more efficent to have a big loop doing 3 line
operations(for example a 3D spatial operation) than 2 nested loops doing the same op.
• Avoid correlated loops: This kind of loops are really difficult to parallelize by their
interdependences.
• In case of a I/O-Memory bounding: The best strategy is an MPI multinode
parallelization.
• A tested parallel/optimized library/algorithm could reduce coding time and other
issues.
Don't re-invent the wheel, "know the keywords" is not equal to "Be a developer"
Tips & Tricks: Programming hints

Script generation and Parallel job launch

How to generate SLURM script files: 1º Identify app parallelism
Thread parallelism
Process parallelism
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=NCORES
#SBATCH --ntasks=NCORES
#SBATCH --cpus-per-task=1

How to generate SLURM script files: 2º Determine the memory requirements
#SBATCH –-mem=63900
#SBATCH --partition=std-fat
The partition choice is strongly
dependent on the job memory
requirements !!
#SBATCH --partition=std
#SBATCH --partition=mem
#SBATCH –-mem-per-cpu=3900
#SBATCH --ntasks=16
#SBATCH --partition=std
Partition Memory/core*
std/gpu
std-fat/KNL
mem
~4Gb
~8Gb
~24Gb
* Real memory values:
std 3,9 Gb/core
std-fat/KNL 7,9GB(core)
mem: 23,9GB/core

How to generate SLURM script files: 3º RunTime requirements
#SBATCH --time=Thpc
WORKSTATION
4 Cores(Nws)
8-16Gb RAM
1Tb 600mb/s
Ethernet 1-10 Gbps
HPC NODE
48 Cores(Nhpc)
192-384 Gb RAM
200Tb 4Gb/s
Infiniband 100-200Gbps
Performance comparison At first approximation:

How to generate SLURM script files: 4º Disk/IO requirements
"Two" types of application
Threaded/serial Multitask/MPI
Only one node: Multinode:
cd $SHAREDSCRATCH
or
cd $LOCALSCRATCH
cd $SHAREDSCRATCH
Or let the AI decide for you
cd $SCRATCH

How to generate SLURM script files: Summary
1. Identify your application parallelism.
2. Estimate the resources needed by your solving algorithm.
3. Estimate as better as possible the required runtime.
4. Determine your job I/O and input(files) requirements.
5. Determine which are the necessary output files and save only these files
in your own disk space.

Gaussian 16 (Threaded Example)
#!/bin/bash
#SBATCH -j gau16_test
#SBATCH -o gau_test_%j.log
#SBATCH -e gau_test_%j.err
#SBATCH -n 1
#SBATCH -c 16
#SBATCH -p std
#SBATCH –mem=30000
#SBATCH –time=10-00
module load gaussian/g16b1
INPUT_DIR=/$HOME/gaussian_test/inputs
OUTPUT_DIR=$HOME/gaussian_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/* .
g16 < input.gau > output.out
mkdir -p $OUTPUT_DIR
cp -r output.out $output
Threaded application
Less than 4Gb/core , std partition
10 Days RunTime
Set up environment to run the APP

Vasp (Multitask Example)
#!/bin/bash
#SBATCH -j vasp_test_%j
#SBATCH -o vasp_test_%j.log
#SBATCH –e vasp_test_%j.err
#SBATCH -n 24
#SBATCH –c 1
#SBATCH –mem-per-cpu=7500
#SBATCH -p std-fat
#SBATCH –time=20:00
module load vasp/5.4.4
INPUT_DIR=/$HOME/vasp_test/inputs
OUTPUT_DIR=$HOME/vasp_test/outputs
cd $SCRATCH
srun `which vasp_std`
cp -r * $output
Multitask/MPI application
More than 4Gb/core, but less than
8Gb/core -> std-fat partition
20 Min RunTime
Set up environment to run the APP
Multitask app requires 'srun' command
(but there are exceptions like ORCA)

Gromacs (MultiTask and threaded/Accelerated Example)
#!/bin/bash
#SBATCH --job-name=gromacs
#SBATCH --output=gromacs_%j.out
#SBATCH --error=gromacs_%j.err
#SBATCH -n 24
#SBATCH -c 2
#SBATCH -N 1
#SBATCH -p gpu
#SBATCH --gres=gpu:2
#SBATCH --time=00:30:00
module load gromacs/2018.4_mpi
cd $SHAREDSCRATCH
cp -r $HOME/SLMs/gromacs/CASE/* .
srun `which gmx_mpi` mdrun -v -deffnm input_system -ntomp
$SLURM_CPUS_PER_TASK -nb gpu -npme 12 -dlb yes -pin on –gpu_id 01
cp –r * /scratch/$USER/gromacs/CASE/output/
1 NODE Hybrid job!
2GPUs/Node on GPU partition

ANSYS Fluent (MultiTask Example)
#!/bin/bash
#SBATCH -j truck.cas
#SBATCH -o truck.log
#SBATCH -e truck.err
#SBATCH -p std
#SBATCH -n 16
#SBATCH –time=10-20:00
module load toolchains/gcc_mkl_ompi
INPUT_DIR=$HOME/FLUENT/inputs
OUTPUT_DIR=$HOME/FLUENT/outputs
cd $SCRATCH
/prod/ANSYS16/v162/fluent/bin/fluent 3ddp –t $SLURM_NCPUS -mpi=hp -g -i input1_50.txt
cp -r * $output

Best Practices
• "More cores" not always is equal to "less runtime"
• Move only the necessary files(not all files each time).
• Use $SCRATCH as working directory.
• Try to keep only important files at $HOME
• Try to choose the partition and resoruces whose most fit to your job
requirements.

Initial situation: The typical computer consist in a CPU(~4 cores), a limited amount of
memory(~8Gb) and disc(1Tb), with a low efficiency ...

Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA
"Easy" and Good only for Compute Bounding
Not so "easy", Good for Compute/Memory Bounding
Absolutelly(to much) flexible

Tips & Tricks: Arquitectures and programming paradigms: MPI , OpenMP , CUDA

Identify a bounding(Bottleneck) "at a glance"
If you get significant more performance increasing the number of
Cores on the same number of Sockets...Is "Compute bounded".
Sockets on the same number of Cores...Is "Memory/Bandwidth
bounded".
Nodes on the same number of Cores and Sockets (or using a
faster HDD)...Is I/O bounded.

Identify a bounding(Bottleneck) "at a glance"
In fact, all real applications have different kind of bounds on different
parts of the code....
Cores on the same number of Sockets...Is "Compute bounded".
Sockets on the same number of Cores...Is "Memory/Bandwidth
bounded".
Nodes on the same number of Cores and Sockets (or using a
faster HDD)...Is I/O bounded.

Introduction to Parallelization ans performance optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Parallelization ans performance optimization

Similar to Introduction to Parallelization ans performance optimization (20)

More from CSUC - Consorci de Serveis Universitaris de Catalunya

More from CSUC - Consorci de Serveis Universitaris de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Introduction to Parallelization ans performance optimization