SlideShare a Scribd company logo
1 of 29
Download to read offline
Webinar 20130404

Molecular Simulation with
GROMACS on CUDA GPUs
Erik Lindahl
GROMACS is used on a
wide range of resources

We’re comfortably
on the single-μs
scale today

Larger machines often
mean larger systems,
not necessarily longer
simulations
Why use GPUs?
Throughput
• Sampling
• Free energy
• Cost efficiency
• Power efficiency
• Desktop simulation
• Upgrade old machines
• Low-end clusters

Performance
• Longer simulations
• Parallel GPU simulation
•
•

using Infiniband
High-end efficiency by
using fewer nodes
Reach timescales not
possible with CPUs
Many GPU programs today
Caveat emperor:
It is much easier to get a reference
problem/algorithm to scale
i.e., you see much better
relative scaling before
introducing any optimization on the CPU side
When comparing programs:
What matters is absolute performance
(ns/day), not the relative speedup!
Gromacs-4.5 with OpenMM
Previous version - what was the limitation?
Gromacs running
entirely on CPU as
a fancy interface
Actual simulation running
entirely on GPU
using OpenMM kernels
Only a few select algorithms worked
Multi-CPU sometimes beat GPU performance...
Why don’t we use the CPU too?

0.5-1 TFLOP
Random memory
access OK (not great)

Great for complex
latency-sensitive stuff
(domain decomposition, etc.)

~2 TFLOP
Random memory
access won’t work
Great for
throughput
Gromacs-4.6 next-generation GPU implementation:
Programming model
Domain decomposition
dynamic load balancing
1 MPI rank 1 MPI rank

1 MPI rank 1 MPI rank

CPU
N OpenMP N OpenMP
(PME) threads
threads

N OpenMP N OpenMP
threads
threads

Load balancing
GPU

1 GPU
context

1 GPU
context

Load balancing
1 GPU
context

1 GPU
context
Heterogeneous CPU-GPU acceleration in GROMACS-4.6

Wallclock time for an MD step:
~0.5 ms if we want to simulate 1μs/day
We cannot afford to lose all previous acceleration tricks!
CPU trick 1: all-bond constraints
• •Δt limited by fast motions - 1fs
• SHAKE (iterative, slow) - 2fs
Remove bond vibrations

•
•

•

Problematic in parallel (won’t work)
Compromise: constrain h-bonds only 1.4fs

GROMACS (LINCS):

•
•
•
•
•
•

LINear Constraint Solver
Approximate matrix inversion expansion
Fast & stable - much better than SHAKE
Non-iterative
Enables 2-3 fs timesteps
Parallel: P-LINCS (from Gromacs 4.0)

LINCS:

t=2’
t=1

A) Move w/o constraint
t=2’’
t=1
B) Project out motion
along bonds
t=2
t=1
C) Correct for rotational
extension of bond
CPU trick 2: Virtual sites
•
•

Next fastest motions is H-angle and
rotations of CH3/NH2 groups
Try to remove them:

•
•
•
•

•

Ideal H position from heavy atoms.
CH3/NH2 groups are made rigid
Calculate forces, then project back onto heavy atoms
Integrate only heavy atom positions, reconstruct H’s

Enables 5fs timesteps!
θ
1-a

2

a

a

b

a

|d |

1-a

3

|b |

3fd

3fad

|c |

3out

4fd

Interactions

Degrees of Freedom
dista
actio
this i
intera
it to
tem.
Fo
tween
7
C’
3
B’ 2
1D d
C
B
decom
A’
4
6
A
sions
rc 5
1
0
allow
rc
3
1
detai
0
8th-sphere
comm
most
FIG. 3: The zones to communicate to the proces
FIG. 2: The domain decomposition cells (1-7)for details.
see the text that communinami
cate coordinates to cell 0. Cell 2 is hidden below cell 7. The
parts
zones that need to be communicated to cell 0 are dashed, rc
ensure that all bonded interaction between ch
cut-o
is the cut-o radius.
can be assigned to a processor, it is su⌅cien
balan
that the charge groups within a sphere of ra
muni
present on at least one processor for every p
ter of the sphere. In Fig. ?? this means we a
balan
communicate volumes B’ and C’. When no bo
are calculated.
in cel
actions are present between charge groups, th
are not communicated. For 2D decomposition
bond
Bonded interactions are distributed over the processors
C’ are the only extra volumes that need to
calcu
by finding the smallest x, y and z coordinate of the charge pictures be
For 3D domain decomposition the
be
tions
groups involved and assigning the ainteraction to thebut the procedure i
bit more complicated, pro-

CPU trick 3: Non-rectangular
cells & decomposition

Load balancing works
for arbitrary triclinic cells

Lysozyme, 25k atoms
All these “tricks” now work fine
Rhombic dodecahedron
(36k atoms in cubic cell) with GPUs in GROMACS-4.6!

apart from more extensive book-keeping. All
From neighborlists to cluster
pair lists in GROMACS-4.6
x,y grid
z sort
z bin
x,y,z
gridding

Organize
as tiles with
Cluster pairlist
all-vs-all
interactions:

X
X
X
X

X
X
X
X

X
X
X
X

X
X
X
X
Tiling circles is difficult
Need a lot of cubes
to cover a sphere
Interactions outside
cutoff should be 0.0
Group cutoff

•
•

Verlet cutoff

GROMACS-4.6 calculates a “large enough” buffer
zone so no interactions are missed
Optimize nstlist for performance - no need to
worry about missing any interactions with Verlet!
Tixel algorithm work-efficiency
8x8x8 tixels compared to a non performance-optimized Verlet scheme
0.36

rc=1.5, rl=1.6

0.58
0.82

Verlet
Tixel Pruned
Tixel non-pruned

0.29

rc=1.2, rl=1.3

0.52
0.75
0.21

rc=0.9, rl=1.0

0.42
0.73

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Highly memory-efficient algorithm:
Can handle 20-40 million atoms with 2-3GB memory
Even cheap consumer cards will get you a long way
PME weak scaling
Xeon X5650 3T + C2075 / process
0.35

480 μs/step (1500 atoms)

1xC2075 CUDA F kernel
1xC2075 CPU total
2xC2075 CPU total
4xC2075 CPU total

Iteration time per 1000 atoms (ms/step)

0.3

0.25

0.2

Text
0.15

700 μs/step (6000 atoms)
0.1

Complete time step including
kernel, h2d, d2h, CPU constraints,
CPU PME, CPU integration,OpenMP & MPI

0.05

0
1.5

3

6

12

24

48

96

192

System size/GPU (1000s of atoms)

384

768

1536

3072
Example performance: Systems with
~24,000 atoms, 2 fs time steps, NPT
Amber:
DHFR

CPU, 96 CPU cores
GPU, 1xGTX680
GPU, 4xGTX680
0

Gromacs:
RNAse

100

ns/day

200

300

200

300

CPU, 6 cores
CPU, 2*8 cores

6 CPU cores +1xK20c GPU
6 CPU cores +1xGTX680 GPU
dodec+vsites(5fs), 6 CPU cores
dodec+vsites(5fs), 2*8 CPU cores
dodec+vsites(5fs), 6 cores + 1xK20c

dodec+vsites(5fs), 6 cores + 1xGTX680
0

100
The Villin headpiece
~8,000 atoms, 5 fs steps
explicit solvent
triclinic box
PME electrostatics
i7 3930K (GMX 4.5)
i7 3930K (GMX 4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0

200

400

600

ns/day

800

1000

2,546 FPS (beat that, Battlefield 4)

1200
GLIC: Ion channel
membrane protein
150,000 atoms
Running on a simple desktop!

i7 3930K (GMX4.5)
i7 3930K (GMX4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0

10

20

ns/day

30

40
Strong scaling of Reaction-Field and &
Scaling of Reaction-fieldPME PME
1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff

Performance (ns/day)

100

10

1
RF
RF linear scaling
PME
PME linear scaling

0.1
1

10

100

#Processes-GPUs

Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth
Small systems often work best using only a single GPU!
GROMACS 4.6 extreme scaling
Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9
1000

XK6/X2090
XK7/K20X
XK6 CPU only
XE6 CPU only

ns/day

100

10

1
1

2

4
8
16
#sockets (CPU or CPU+GPU)

32

64
Using GROMACS
with GPUs in practice
Compiling GROMACS with CUDA

•
•
•
•

Make sure CUDA driver is installed
Make sure CUDA SDK is in /usr/local/cuda
Use the default GROMACS distribution
Just run ‘cmake’ and we will detect CUDA
automatically and use it

•
•

gcc-4.7 works great as a compiler
On Macs, you want to use icc (commercial)

Longer Mac story: Clang does not support OpenMP,
which gcc does. However, the current gcc versions for
Macs do not support AVX on the CPU. icc supports both!
Using GPUs in practice
In your mdp file:
cutoff-scheme
nstlist
coulombtype
vdw-type
nstcalcenergy

=
=
=
=
=

Verlet
10
; likely 10-50
pme
; or reaction-field
cut-off
-1
; only when writing edr

• Verlet cutoff-scheme is more accurate
• Necessary for GPUs in GROMACS
• Use -testverlet mdrun option to force it w. old tpr files
• Slower on a single CPU, but scales well on CPUs too!
Shift modifier is applied to both coulomb and VdW by
default on GPUs - change with coulomb/vdw-modifier
Load balancing
rcoulomb
fourierspacing

= 1.0
= 0.12

• If we increase/decrease the coulomb direct-space
•
•

cutoff and the reciprocal space PME grid spacing by
the same amount, we maintain accuracy
... but we move work between CPU & GPU!
By default, GROMACS-4.6 does this automatically at
the start of each run - you will see diagnostic output
GROMACS excels when you combine a fairly fast
CPU and GPU. Currently, this means Intel CPUs.
Demo
Acknowledgments
•
•
•
•

GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham
Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov
Multi-Threaded PME: Roland Shultz, Berk Hess
Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!
 
Test Drive K20
GPUs!

Questions?

Run GROMACS on Tesla K20
GPU today

Devang Sachdev - NVIDIA
dsachdev@nvidia.com
@DevangSachdev

Contact us

Experience The Acceleration

Sign up for FREE GPU Test Drive
on remotely hosted clusters
www.nvidia.com/GPUTestDrive

GROMACS questions
Check	
  www.gromacs.org
gmx-users@gromacs.org	
  

mailing	
  list
Stream other webinars from GTC
Express:
http://www.gputechconf.com/
page/gtc-express-webinar.html
Register for the Next GTC
Express Webinar
Molecular Shape Searching on GPUs
Paul Hawkins, Applications Science Group Leader, OpenEye
Wednesday, May 22, 2013, 9:00 AM PDT
Register at www.gputechconf.com/gtcexpress

More Related Content

Viewers also liked

Gromacs on Science Gateway
Gromacs on Science GatewayGromacs on Science Gateway
Gromacs on Science Gatewayriround
 
Force Field Analysis by Slideshop
Force Field Analysis by SlideshopForce Field Analysis by Slideshop
Force Field Analysis by SlideshopSlideShop.com
 
Force field analysis
Force field analysisForce field analysis
Force field analysisRobin Jadhav
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and SimulationsAbhilash Kannan
 
Force Field Analysis
Force  Field  AnalysisForce  Field  Analysis
Force Field Analysispuspaltamuli
 

Viewers also liked (7)

Force field analysis april2011
Force field analysis april2011Force field analysis april2011
Force field analysis april2011
 
Example of force fields
Example of force fieldsExample of force fields
Example of force fields
 
Gromacs on Science Gateway
Gromacs on Science GatewayGromacs on Science Gateway
Gromacs on Science Gateway
 
Force Field Analysis by Slideshop
Force Field Analysis by SlideshopForce Field Analysis by Slideshop
Force Field Analysis by Slideshop
 
Force field analysis
Force field analysisForce field analysis
Force field analysis
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and Simulations
 
Force Field Analysis
Force  Field  AnalysisForce  Field  Analysis
Force Field Analysis
 

More from Can Ozdoruk

ROAD FROM $0 TO $10M: 10 GROWTH TIPS
ROAD FROM $0 TO $10M: 10 GROWTH TIPSROAD FROM $0 TO $10M: 10 GROWTH TIPS
ROAD FROM $0 TO $10M: 10 GROWTH TIPSCan Ozdoruk
 
Cloudinary Webinar Responsive Images
Cloudinary Webinar Responsive ImagesCloudinary Webinar Responsive Images
Cloudinary Webinar Responsive ImagesCan Ozdoruk
 
Image optimization q_auto - f_auto
Image optimization q_auto - f_autoImage optimization q_auto - f_auto
Image optimization q_auto - f_autoCan Ozdoruk
 
Boomerang-ConsumerElectronics-RAR
Boomerang-ConsumerElectronics-RARBoomerang-ConsumerElectronics-RAR
Boomerang-ConsumerElectronics-RARCan Ozdoruk
 
White-Paper-Consumer-Electronics
White-Paper-Consumer-ElectronicsWhite-Paper-Consumer-Electronics
White-Paper-Consumer-ElectronicsCan Ozdoruk
 
Boomerang-Toys-RAR
Boomerang-Toys-RARBoomerang-Toys-RAR
Boomerang-Toys-RARCan Ozdoruk
 
SacramentoKings_Case-Study
SacramentoKings_Case-StudySacramentoKings_Case-Study
SacramentoKings_Case-StudyCan Ozdoruk
 
Product Marketing 101
Product Marketing 101Product Marketing 101
Product Marketing 101Can Ozdoruk
 
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChemChallenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChemCan Ozdoruk
 
Supercharging MD Simulations with GPUs
Supercharging MD Simulations with GPUsSupercharging MD Simulations with GPUs
Supercharging MD Simulations with GPUsCan Ozdoruk
 
NVIDIA Tesla K40 GPU
NVIDIA Tesla K40 GPUNVIDIA Tesla K40 GPU
NVIDIA Tesla K40 GPUCan Ozdoruk
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldCan Ozdoruk
 
Introduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template LibraryIntroduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template LibraryCan Ozdoruk
 
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUsACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUsCan Ozdoruk
 
AMBER and Kepler GPUs
AMBER and Kepler GPUsAMBER and Kepler GPUs
AMBER and Kepler GPUsCan Ozdoruk
 

More from Can Ozdoruk (16)

ROAD FROM $0 TO $10M: 10 GROWTH TIPS
ROAD FROM $0 TO $10M: 10 GROWTH TIPSROAD FROM $0 TO $10M: 10 GROWTH TIPS
ROAD FROM $0 TO $10M: 10 GROWTH TIPS
 
Cloudinary Webinar Responsive Images
Cloudinary Webinar Responsive ImagesCloudinary Webinar Responsive Images
Cloudinary Webinar Responsive Images
 
Image optimization q_auto - f_auto
Image optimization q_auto - f_autoImage optimization q_auto - f_auto
Image optimization q_auto - f_auto
 
Boomerang-ConsumerElectronics-RAR
Boomerang-ConsumerElectronics-RARBoomerang-ConsumerElectronics-RAR
Boomerang-ConsumerElectronics-RAR
 
White-Paper-Consumer-Electronics
White-Paper-Consumer-ElectronicsWhite-Paper-Consumer-Electronics
White-Paper-Consumer-Electronics
 
Boomerang-Toys-RAR
Boomerang-Toys-RARBoomerang-Toys-RAR
Boomerang-Toys-RAR
 
SacramentoKings_Case-Study
SacramentoKings_Case-StudySacramentoKings_Case-Study
SacramentoKings_Case-Study
 
Product Marketing 101
Product Marketing 101Product Marketing 101
Product Marketing 101
 
AMBER14 & GPUs
AMBER14 & GPUsAMBER14 & GPUs
AMBER14 & GPUs
 
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChemChallenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
 
Supercharging MD Simulations with GPUs
Supercharging MD Simulations with GPUsSupercharging MD Simulations with GPUs
Supercharging MD Simulations with GPUs
 
NVIDIA Tesla K40 GPU
NVIDIA Tesla K40 GPUNVIDIA Tesla K40 GPU
NVIDIA Tesla K40 GPU
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New World
 
Introduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template LibraryIntroduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template Library
 
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUsACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
 
AMBER and Kepler GPUs
AMBER and Kepler GPUsAMBER and Kepler GPUs
AMBER and Kepler GPUs
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Gromacs and Kepler GPUs

  • 1. Webinar 20130404 Molecular Simulation with GROMACS on CUDA GPUs Erik Lindahl
  • 2. GROMACS is used on a wide range of resources We’re comfortably on the single-μs scale today Larger machines often mean larger systems, not necessarily longer simulations
  • 3. Why use GPUs? Throughput • Sampling • Free energy • Cost efficiency • Power efficiency • Desktop simulation • Upgrade old machines • Low-end clusters Performance • Longer simulations • Parallel GPU simulation • • using Infiniband High-end efficiency by using fewer nodes Reach timescales not possible with CPUs
  • 4. Many GPU programs today Caveat emperor: It is much easier to get a reference problem/algorithm to scale i.e., you see much better relative scaling before introducing any optimization on the CPU side When comparing programs: What matters is absolute performance (ns/day), not the relative speedup!
  • 5.
  • 6. Gromacs-4.5 with OpenMM Previous version - what was the limitation? Gromacs running entirely on CPU as a fancy interface Actual simulation running entirely on GPU using OpenMM kernels Only a few select algorithms worked Multi-CPU sometimes beat GPU performance...
  • 7. Why don’t we use the CPU too? 0.5-1 TFLOP Random memory access OK (not great) Great for complex latency-sensitive stuff (domain decomposition, etc.) ~2 TFLOP Random memory access won’t work Great for throughput
  • 8. Gromacs-4.6 next-generation GPU implementation: Programming model Domain decomposition dynamic load balancing 1 MPI rank 1 MPI rank 1 MPI rank 1 MPI rank CPU N OpenMP N OpenMP (PME) threads threads N OpenMP N OpenMP threads threads Load balancing GPU 1 GPU context 1 GPU context Load balancing 1 GPU context 1 GPU context
  • 9. Heterogeneous CPU-GPU acceleration in GROMACS-4.6 Wallclock time for an MD step: ~0.5 ms if we want to simulate 1μs/day We cannot afford to lose all previous acceleration tricks!
  • 10. CPU trick 1: all-bond constraints • •Δt limited by fast motions - 1fs • SHAKE (iterative, slow) - 2fs Remove bond vibrations • • • Problematic in parallel (won’t work) Compromise: constrain h-bonds only 1.4fs GROMACS (LINCS): • • • • • • LINear Constraint Solver Approximate matrix inversion expansion Fast & stable - much better than SHAKE Non-iterative Enables 2-3 fs timesteps Parallel: P-LINCS (from Gromacs 4.0) LINCS: t=2’ t=1 A) Move w/o constraint t=2’’ t=1 B) Project out motion along bonds t=2 t=1 C) Correct for rotational extension of bond
  • 11. CPU trick 2: Virtual sites • • Next fastest motions is H-angle and rotations of CH3/NH2 groups Try to remove them: • • • • • Ideal H position from heavy atoms. CH3/NH2 groups are made rigid Calculate forces, then project back onto heavy atoms Integrate only heavy atom positions, reconstruct H’s Enables 5fs timesteps! θ 1-a 2 a a b a |d | 1-a 3 |b | 3fd 3fad |c | 3out 4fd Interactions Degrees of Freedom
  • 12. dista actio this i intera it to tem. Fo tween 7 C’ 3 B’ 2 1D d C B decom A’ 4 6 A sions rc 5 1 0 allow rc 3 1 detai 0 8th-sphere comm most FIG. 3: The zones to communicate to the proces FIG. 2: The domain decomposition cells (1-7)for details. see the text that communinami cate coordinates to cell 0. Cell 2 is hidden below cell 7. The parts zones that need to be communicated to cell 0 are dashed, rc ensure that all bonded interaction between ch cut-o is the cut-o radius. can be assigned to a processor, it is su⌅cien balan that the charge groups within a sphere of ra muni present on at least one processor for every p ter of the sphere. In Fig. ?? this means we a balan communicate volumes B’ and C’. When no bo are calculated. in cel actions are present between charge groups, th are not communicated. For 2D decomposition bond Bonded interactions are distributed over the processors C’ are the only extra volumes that need to calcu by finding the smallest x, y and z coordinate of the charge pictures be For 3D domain decomposition the be tions groups involved and assigning the ainteraction to thebut the procedure i bit more complicated, pro- CPU trick 3: Non-rectangular cells & decomposition Load balancing works for arbitrary triclinic cells Lysozyme, 25k atoms All these “tricks” now work fine Rhombic dodecahedron (36k atoms in cubic cell) with GPUs in GROMACS-4.6! apart from more extensive book-keeping. All
  • 13. From neighborlists to cluster pair lists in GROMACS-4.6 x,y grid z sort z bin x,y,z gridding Organize as tiles with Cluster pairlist all-vs-all interactions: X X X X X X X X X X X X X X X X
  • 14. Tiling circles is difficult Need a lot of cubes to cover a sphere Interactions outside cutoff should be 0.0 Group cutoff • • Verlet cutoff GROMACS-4.6 calculates a “large enough” buffer zone so no interactions are missed Optimize nstlist for performance - no need to worry about missing any interactions with Verlet!
  • 15. Tixel algorithm work-efficiency 8x8x8 tixels compared to a non performance-optimized Verlet scheme 0.36 rc=1.5, rl=1.6 0.58 0.82 Verlet Tixel Pruned Tixel non-pruned 0.29 rc=1.2, rl=1.3 0.52 0.75 0.21 rc=0.9, rl=1.0 0.42 0.73 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Highly memory-efficient algorithm: Can handle 20-40 million atoms with 2-3GB memory Even cheap consumer cards will get you a long way
  • 16. PME weak scaling Xeon X5650 3T + C2075 / process 0.35 480 μs/step (1500 atoms) 1xC2075 CUDA F kernel 1xC2075 CPU total 2xC2075 CPU total 4xC2075 CPU total Iteration time per 1000 atoms (ms/step) 0.3 0.25 0.2 Text 0.15 700 μs/step (6000 atoms) 0.1 Complete time step including kernel, h2d, d2h, CPU constraints, CPU PME, CPU integration,OpenMP & MPI 0.05 0 1.5 3 6 12 24 48 96 192 System size/GPU (1000s of atoms) 384 768 1536 3072
  • 17. Example performance: Systems with ~24,000 atoms, 2 fs time steps, NPT Amber: DHFR CPU, 96 CPU cores GPU, 1xGTX680 GPU, 4xGTX680 0 Gromacs: RNAse 100 ns/day 200 300 200 300 CPU, 6 cores CPU, 2*8 cores 6 CPU cores +1xK20c GPU 6 CPU cores +1xGTX680 GPU dodec+vsites(5fs), 6 CPU cores dodec+vsites(5fs), 2*8 CPU cores dodec+vsites(5fs), 6 cores + 1xK20c dodec+vsites(5fs), 6 cores + 1xGTX680 0 100
  • 18. The Villin headpiece ~8,000 atoms, 5 fs steps explicit solvent triclinic box PME electrostatics i7 3930K (GMX 4.5) i7 3930K (GMX 4.6) i7 3930K+GTX680 E5-2690+GTX Titan 0 200 400 600 ns/day 800 1000 2,546 FPS (beat that, Battlefield 4) 1200
  • 19. GLIC: Ion channel membrane protein 150,000 atoms Running on a simple desktop! i7 3930K (GMX4.5) i7 3930K (GMX4.6) i7 3930K+GTX680 E5-2690+GTX Titan 0 10 20 ns/day 30 40
  • 20. Strong scaling of Reaction-Field and & Scaling of Reaction-fieldPME PME 1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff Performance (ns/day) 100 10 1 RF RF linear scaling PME PME linear scaling 0.1 1 10 100 #Processes-GPUs Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth Small systems often work best using only a single GPU!
  • 21. GROMACS 4.6 extreme scaling Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9 1000 XK6/X2090 XK7/K20X XK6 CPU only XE6 CPU only ns/day 100 10 1 1 2 4 8 16 #sockets (CPU or CPU+GPU) 32 64
  • 22. Using GROMACS with GPUs in practice
  • 23. Compiling GROMACS with CUDA • • • • Make sure CUDA driver is installed Make sure CUDA SDK is in /usr/local/cuda Use the default GROMACS distribution Just run ‘cmake’ and we will detect CUDA automatically and use it • • gcc-4.7 works great as a compiler On Macs, you want to use icc (commercial) Longer Mac story: Clang does not support OpenMP, which gcc does. However, the current gcc versions for Macs do not support AVX on the CPU. icc supports both!
  • 24. Using GPUs in practice In your mdp file: cutoff-scheme nstlist coulombtype vdw-type nstcalcenergy = = = = = Verlet 10 ; likely 10-50 pme ; or reaction-field cut-off -1 ; only when writing edr • Verlet cutoff-scheme is more accurate • Necessary for GPUs in GROMACS • Use -testverlet mdrun option to force it w. old tpr files • Slower on a single CPU, but scales well on CPUs too! Shift modifier is applied to both coulomb and VdW by default on GPUs - change with coulomb/vdw-modifier
  • 25. Load balancing rcoulomb fourierspacing = 1.0 = 0.12 • If we increase/decrease the coulomb direct-space • • cutoff and the reciprocal space PME grid spacing by the same amount, we maintain accuracy ... but we move work between CPU & GPU! By default, GROMACS-4.6 does this automatically at the start of each run - you will see diagnostic output GROMACS excels when you combine a fairly fast CPU and GPU. Currently, this means Intel CPUs.
  • 26. Demo
  • 27. Acknowledgments • • • • GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov Multi-Threaded PME: Roland Shultz, Berk Hess Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!  
  • 28. Test Drive K20 GPUs! Questions? Run GROMACS on Tesla K20 GPU today Devang Sachdev - NVIDIA dsachdev@nvidia.com @DevangSachdev Contact us Experience The Acceleration Sign up for FREE GPU Test Drive on remotely hosted clusters www.nvidia.com/GPUTestDrive GROMACS questions Check  www.gromacs.org gmx-users@gromacs.org   mailing  list Stream other webinars from GTC Express: http://www.gputechconf.com/ page/gtc-express-webinar.html
  • 29. Register for the Next GTC Express Webinar Molecular Shape Searching on GPUs Paul Hawkins, Applications Science Group Leader, OpenEye Wednesday, May 22, 2013, 9:00 AM PDT Register at www.gputechconf.com/gtcexpress