SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
POLITECHNIKA WROC£AWSKA
WYDZIA£ INFORMATYKI I ZARZ•DZANIA
GPGPU driven simulations of zero-temperature 1D
Ising model with Glauber dynamics
Daniel Kosalla
FINAL THESIS
under supervision of
Dr inø. Dariusz Konieczny
Wroc≥aw 2013
Acknowledgments:
Dr inø. Dariusz Konieczny
Contents
1. Motivation 5
2. Target 5
3. Scope of work 5
4. Theoretical background and proposed model 6
4.1. Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2. Historic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.4. Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.5. Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.6. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.7. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5. General Purpose Graphic Processing Units 10
5.1. History of General Purpose GPUs . . . . . . . . . . . . . . . . . . . . . . 10
5.2. CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6. CPU Simulations 14
6.1. Sequential algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.2. Random number generation on CPU . . . . . . . . . . . . . . . . . . . . 15
6.3. CPU performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7. GPU Simulations - thread per simulation 17
7.1. Thread per simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.2. Running the simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3. Solution space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.4. Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.5. Thread per simulation - static memory . . . . . . . . . . . . . . . . . . . 20
7.6. Comparison of static and dynamic memory use . . . . . . . . . . . . . . . 21
8. GPU Simulations - thread per spin 24
8.1. Thread per spin approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.2. Concurrent execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iii
8.3. Thread communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.4. Race conditions with shared memory . . . . . . . . . . . . . . . . . . . . 27
8.5. Thread per spin approach - reduction . . . . . . . . . . . . . . . . . . . . 27
8.6. Thread per spin approach - flags . . . . . . . . . . . . . . . . . . . . . . 29
8.7. Thread-per-spin performance . . . . . . . . . . . . . . . . . . . . . . . . 30
8.8. Thread-per-spin vs thread-per-simulation performance . . . . . . . . . . . 31
9. Bond density for some W0 values 34
10.Conclusions 36
11.Future work 36
Appendix 38
A. Sequential algorithm - CPU 39
B. Thread per simulation - no optimizations 43
C. Thread per simulation - static memory 48
D. Thread per spin - no optimizations 53
E. Thread per spin - parallel reduction 58
F. Thread per spin - update flag 63
iv
1. Motivation
In the presence of recent developments of SCM (Single Chain Magnets) [1–4] the
issue of criticality in 1D Ising-like magnet chains has turned out to be an promising field
of study [5–8]. Some practical applications has been already suggested [2]. Unfortunately,
the details of general mechanism driving this changes in real world is yet to be discovered.
Traditionaly, Monte Carlo Simulations regarding Ising model were conducted on CPUs1
.
However, in the presence of powerful GPGPU’s2
new trend in scientific computations
was started enabling more detailed and even faster calculations.
2. Target
The following document describes developed GPGPU applications capable of pro-
ducing insights into underlying physical problem, examination of diÄerent approaches
of conducting Monte Carlo simulations on GPGPU and comparison between developed
parallel GPGPU algorithms and sequential CPU-based approach.
3. Scope of work
The scope of this document includes development of 5 parallel GPGPU algorithms,
namely:
• Thread-per-simulation algorithm
• Thread-per-simulation algorithm with static memory
• Thread-per-spin algorithm
• Thread-per-spin algorithm with flags
• Thread-per-spin algorithm with reduction
1
CPU - Central Processing Unit
2
GPGPU - General Purpose Graphics Processing Unit
5
4. Theoretical background and proposed model
4.1. Ising model
Although initially proposed by Wilhelm Lenz, it was Ernst Ising[10], who developed
a mathematical model for ferromagnetic phenomena. Ising model is usually represented
by means of lattice of spins - discrete variables {≠1, 1}, representing magnetic dipole
moments of molecules in the material. The spins are then interacting with it’s neighbours,
which may cause the phase transition of the whole lattice.
4.2. Historic methods
Monte Carlo Simulation (MC) on Ising model consist of a sequence of lattice updates.
Traditionally all (synchronous) or single (sequential) spins are updated in each iteration
producing the lattice-state for future iterations. The update methods are based on the
so called dynamics that are describing spin interactions.
4.3. Updating
The idea of partially synchronous updating scheme has been suggested [5–7]. This
c-synchronous mode has a fixed parameter of spins being updated in one step-time.
However, one can imagine, that the number of updated spins/molecules (often referred
to as cL, where: L denotes size of the chain and c œ (0, 1]) is changing as the simulation
progresses. If so, then it is either linked to some characteristics of the system or may
be expressed with some probability distribution (described in subsection 4.5). This
approach of changing c parameter can be applied while choosing spins randomly as well
as in cluster (subsection 4.6) but only the later will be considered in this document.
4.4. Simulations
In the proposed model cL sequential updating is used with c due to provided
distribution. The considered environment consist of one dimensional array of L spins
si = ±1. Index of each spin is denoted by i = 1, 2, . . . , L. Periodic boundary conditions
are assumed, i.e. sL+1 = s1.
It has been shown in [8] that the system under synchronous Glauber dynamics
reaches one of two absorbing states - ferromagnetic or antiferromagnetic. Therefore,
let’s introduce density of bonds (fl) as an order parameter:
fl =
Lq
i=1
(1 ≠ sisi+1)
2L
(4.1)
6
As stated in [8] phase transitions in synchronous updating modes and c-sequential
[7] ought to be rather continuous (in cases diÄerent then c = 1 for the later). Smooth
phase transition can be observed in the Figure 4.1.
Figure 4.1. The average density of active bonds in the stationary state < flst > as a function
of W0 for c = 0.9 and several lattice sizes L.
[7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero-
temperature Glauber dynamics under partially synchronous updates, Phys. Rev.
E 86, 051113 (2012)
The system is considered in low temperatures (T) and therefore T = 0 can be
assumed. Metropolis algorithm can be considered as a special case of zero-temperature
Glauber dynamics for 1/2 spins. Each spin is flipped (si = ≠si) with rate W(”E) per
unit time. While T = 0:
W(”E) =
Y
____]
____[
1 if ”E < 0,
W0 if ”E = 0,
0 if ”E > 0
(4.3)
In the case of T = 0, the ordering parameter W0 = [0; 1] (e.g. Glauber rate -
W0 = 1/2 or Metropolis rate W0 = 1) is assumed to be constant. One can imagine that
even W0 parameter can in fact be changed during simulation process but that’s out of
scope of proposed model.
System starts in the fully ferromagnetic state (fl = flf = 0). After each time-step
changes are applied to the system and the next time-step is being evaluated. After
predetermined number of time steps state of the system is investigated. If the chain has
obtained antiferromagnetic state (fl = flaf = 1) or suÖciently large number of time-steps
has been inconclusive then whole simulation is being shout down.
4.5. Distributions
During the simulation c will not be fixed in time but rather change from [0; 1]
according to triangular continuous probability distribution[9] presented in the Figure 4.2.
While studying diÄerent initial conditions for simulations, distributions are to be
adjusted in order to provide peak values in range {0, 1}. This is due to the fact that
7
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
Figure 4.2. c could be any value in the interval [0; 1] but is most likely to around value of
c = 1/2. Other values possible but their probabilities are inversely proportional
to their distance from c = 1/2.
the value of 0.5 (as presented in the plot) would mean that in each time-step half of
the spins gets to be updated.
4.6. Updating
The following algorithms make use of triangular probability distribution to assign
appropriate c value before each time step. After (on average) L updated spins each
Monte Carlo Step (MCS) can be distinguished.
4.7. Algorithm
Transformation of the above-mentioned rules into set of instructions could yield in
following description or pseudocode (below):
Update cL consecutive spins starting from randomly chosen one. Each
change is saved to the new array rather than the old one. After each Stop
updated spins are saved and new time-step can be started.
1. Assign c value with given distribution
2. Choose random value of i œ [0, L]
3. max = i + cL
4. si is i-th spin
• if si = si+1 · si = si≠1 :
– sÕ
i = si+1 = si≠1
• otherwise:
– Flip si with probability W0
8
5. if i ˛ max
• i = i + 1
• Go to step 4
6. Stop
9
5. General Purpose Graphic Processing Units
5.1. History of General Purpose GPUs
Traditionally, in desktop computer GPU is highly specialized electronic circuit
designed to robustly handle 2D and 3D graphics. In 1992 Silicon Graphics, released
OpenGL library. OpenGL was meant as standardised, platform-independent interface
for writing 3D graphics. By the mid 1990s an increasing demand for 3D applications
appeared in the customer market.
It was NVIDIA who developed GeForce 256 and branded it as ”the word’s first
GPU”3
. GeForce 256, although, one of many graphical accelerators was one that showed
a very rapid increase in the field incorporating many features such as transform and
lighting computations directly on graphics processor. The release of GPUs capable of
coping with programmable pipelines attracted researchers to explore the possibility of
using graphical processors outside their original use scheme. Although, early GPUs of
early 2000s were programmable in a way that enabled for pixel manipulation, researchers
noticed that since this manipulations could actually represent any kind of operations
and pixels could virtually represent any kind of data.
In the late 2006 NVIDIA revealed GeForce 8800 GTX, the first GPU built with
CUDA Architecture. CUDA Architecture enables programmer to use every arithmetic
logic unit4
on the GPU (as opposed to early days of GPGPU when the access to ALUs
was granted only via the restricted and complicated interface of OpenGL and DirectX).
The new family of GPUs started with 8800 GTX was built with IEEE compliant
ALUs capable of single-precision floating-point arithmetics. Moreover, new ALUs were
equipped not only with extended set of instructions that could be used in general
purpose computing but also enabled for the arbitrary read and write operations to
device memory.
Few months after the lunch of 8800 GTX NVIDIA published a compiler that took
standard C language extended with some additional keywords and transformed it into
fully featured GPU code capable of general purpose processing. It is important to stress
that currently used CUDA C is by far easier to use then OpenGL/DirectX. Programmers
do not have to disguise their data for graphics and can use industry-standard C or even
other languages like C#, Java or Python (via appropriate bindings).
CUDA in now used in various fields of science raging from medical imaging, fluid
dynamics to environmental science and others oÄering enormous, several-orders-of-
magnitude speed ups5
. GPUs are not only faster then CPUs in terms of computed data
3
http://www.nvidia.com/page/geforce256.html
4
ALU - Arithmetic Logic Unit
5
http://www.nvidia.com/object/cuda-apps-flash-new-changed.html
10
per unit time (e.g. FLOPS6
) but also in terms of power and cost eÖciency.
5.2. CUDA Architecture
The underlying architecture of CUDA is driven by design decisions connected with
GPU’s primary purpose, that is graphics processing. Graphics processing is usually
highly parallel process. Therefore, GPU also works in parallel fashion. The important
distinction can be made into logical and physical layer of GPU architecture.
Programmer decomposes computational problem into atomic processes (threads)
that can be executed simultaneously. Since this partition usually results in creation of
hundreds, thousands or even millions if threads. For programmer convenience threads
can be organized inside blocks which in turn are part of blocks. Both, blocks and grids
are 3 dimensional structures. This spatial dimensions are introduced for easier problem
decomposition. As mentioned before: GPU is meant for graphics processing which is
usually related to processing 2D or 3D sets of data.
This grouping is associated not only with logical decomposition of problems, but
also with physical structure of GPU. A basic unit of execution on GPU is the warp.
Warp consist of 32 threads. Each thread in warp belongs to the same block. If the block
is bigger then warp size then threads are divided between several warps. The warps
are executed on the executional unit called Streaming Multiprocessors (SMs). Each
SM executes several warps (not necessarily from the same block). Physically, each SM
consist of 8 streaming processors (SP, CUDA cores) and 32 ”basic” ALUs. 8 SPs spend
4 clock cycles executing the same processor instruction enabling 32 threads in warp to
execute in parallel. Each of the threads in warp can (and usually do) have diÄerent
data supplied to them forming whats known as SIMD7
architecture.
6
FLOPS - Floating Point Operations Per Second
7
Single Instruction, Multiple Data
11
Figure 5.1. Grid of thread blocks
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
CUDA also provides rich memory hierarchy available for every thread. Each of
the memory spaces has it’s own characteristics. The fastest and the smallest memory
memory is the per-thread local memory. Unfortunately, local, register-based memory is
out of reach for CUDA programmer and is used automatically. Each thread in block
can make use of shared memory. This memory can be accessed by diÄerent threads
in block and is usually the main medium of inter-thread communication. The slowest
memory spaces (but available to every thread) are called global, constant and texture
respectively, each of them have diÄerent size and purpose but they are all persistent
across kernel launches by the same application.
12
Figure 5.2. CUDA memory hierarchy
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
13
6. CPU Simulations
6.1. Sequential algorithm
The baseline for presented algorithms will be the sequential, CPU based code. The
simulation itself is executed by the algorithm presented in Listing 1.
Listing 1. Sequential algorithm for CPU
1 while ( monte_carlo_steps < MAX_MCS) {
2 if ( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE) == 0.0 ) {
3 // If lattice is in ferromagnetic state , simulation can stop
4 break;
5 }
6 float C = TRIANGLE_DISTRIBUTION (MIU , SIGMA);
7 first_i = (int)(LATTICE_SIZE * randomUniform ());
8 last_i = (int)(first_i + (C * LATTICE_SIZE));
9 is_lattice_updated = FALSE; // ?
10 for (int i = 0; i < LATTICE_SIZE; i++) {
11 NEXT_STEP_LATTICE [i] = LATTICE[i];
12 if (( first_i <= i && i <= last_i )
13 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= i || i >=
first_i ) )
14 ) {
15 int left = MOD((i-1), LATTICE_SIZE);
16 int right = MOD((i+1), LATTICE_SIZE);
17 // Neighbours are the same and different than the current spin
18 if ( LATTICE[left] == LATTICE[right] ) {
19 NEXT_STEP_LATTICE [i] = LATTICE[left ];
20 }
21 // Otherwise randomly flip the spin
22 else if ( W0 > randomUniform () ) {
23 NEXT_STEP_LATTICE [i] = FLIP_SPIN(LATTICE[i]);
24 }
25 lattice_update_counter ++;
26 }
27 if (LATTICE[i] != NEXT_STEP_LATTICE [i]) {
28 is_lattice_updated = TRUE;
29 }
30 }
31 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE);
32 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);
33 }
The code runs the simulation with initial conditions of MAX MCS, LATTICE SIZE,
LATTICE. LATTICE is set to be an array initialized in antiferromagnetic state (which
could be represented by a sequence of consecutive ones or zeroes). To explore solution
space (the combination of W0, MIU and SIGMA) we run each simulation one after another.
C/C++’s operator is in fact remainder from the division and not the modulo operator
in the mathematical sense. The most prominent diÄerence is that -1 % LATTICE SIZE
14
== -1, whereas: MOD((-1), LATTICE SIZE) == LATTICE SIZE-1. Therefore, while ac-
cessing current spin’s neighbours MOD(x,N) macro is used (Listing 2).
Listing 2. Modulo function-like macro
1 #define MOD(x, N) (((x < 0) ? ((x % N) + N) : x) % N)
6.2. Random number generation on CPU
The CPU code uses GSL8
based Mersenne Twister9
. Usage of GSL-supplied MT is
shown in Listing 3.
Listing 3. GSL’s Mersenne Twister setup
1 #include <gsl/gsl_rng.h>
2 #include <gsl/gsl_randist.h>
3 // ...
4 const gsl_rng_type * T;
5 gsl_rng * r;
6 // ...
7 double randomUniform () {
8 return gsl_rng_uniform (r);
9 }
10 // ...
11 int main(int argc , char *argv []) {
12 gsl_rng_env_setup ();
13 T = gsl_rng_mt19937 ;
14 r = gsl_rng_alloc (T);
15 long seed = time (NULL) * getpid ();
16 gsl_rng_set(r, seed);
17 // simulation
18 // randomUniform () calls
19 }
6.3. CPU performance
The tests of CPU were conducted on quad-core AMD Phenom(tm) II X4 945
Processor with 4GB of RAM. Simulations occupied only one core at the time. The
results presented in Figure 6.1 will be used as a baseline for further comparisons (with
respective MAX MTS values).
8
GSL - GNU Scientific Library, http://www.gnu.org/software/gsl/
9
http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.
html
15
Figure 6.1. Execution times of CPU simulations with MAX MTS equal to 1 000 and 10 000.
Markers denote arithmetic mean of the 5 averages conducted. The curves fitted
are 4-th degree polynomials.
16
7. GPU Simulations - thread per simulation
7.1. Thread per simulation
CUDA provides C/C++-like language for executing code on GPU (CUDA C).
The code is compiled and CUDA compiler via use of specific language extensions
(e.g. device , host ) can distinguish the parts to be executed by CPU(host),
GPU(device) or both(global).
Listing 4. Thread per simulation algorithm
1 while ( monte_carlo_steps < MAX_MCS ) {
2 if( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE)==0.0) {
3 // stop when lattice is in ferromagnetic state
4 break;
5 }
6 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y));
7 float W0 = Z / (float)MAX_Z;
8 first_i = (int)(LATTICE_SIZE * RANDOM (& state[BLOCK_ID ])) +
THREAD_LATTICE_INDEX ;
9 last_i = (int)(first_i + (C * LATTICE_SIZE)) + THREAD_LATTICE_INDEX ;
10 is_lattice_updated = FALSE; // ?
11 for ( int i = THREAD_LATTICE_INDEX ; i < LATTICE_SIZE+ THREAD_LATTICE_INDEX
; i++ ) {
12 NEXT_STEP_LATTICE [i] = LATTICE[i];
13 if (( first_i <= i && i <= last_i )
14 || ( last_i >= LATICE_SIZE+ THREAD_LATICE_INDEX && ( last_i % (
LATICE_SIZE+ THREAD_LATICE_INDEX ) >= i || i >= first_i ) )
15 ) {
16 int left = MOD((i-1), LATICE_SIZE) + THREAD_LATICE_INDEX ;
17 int right = MOD((i+1), LATICE_SIZE) + THREAD_LATICE_INDEX ;
18 // If neighbours are the same
19 if ( LATTICE[left] == LATTICE[right] ) {
20 NEXT_STEP_LATTICE [i] = LATTICE[left ];
21 }
22 // ... otherwise randomly flip the spin
23 else if ( W0 > RANDOM (& state[BLOCK_ID ])) {
24 NEXT_STEP_LATTICE [i] = FLIP_SPIN(LATTICE[i]);
25 }
26 lattice_update_counter ++;
27 }
28 if ( LATTICE[i] != NEXT_STEP_LATTICE [i] ) {
29 is_lattice_updated = TRUE;
30 }
31 }
32 monte_carlo_steps =(int)( lattice_update_counter /LATTICE_SIZE);
33 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);
34 }
17
7.2. Running the simulation
In order for the CUDA compiler and then GPU to execute the code correctly,
programmer has to follow some conventions for the program structure. For instance:
creating functions to be executed on GPU has to be prefixed with global or
device keyword. Moreover, call to a GPU function has to be done with <<<gridDim,
blockDim>>>. The framework for executing code on GPU is shown in Listing 5.
Listing 5. Exemplary foundation of GPU-executed code
1 // Imports
2 // Helper definitions etc.
3 global void generate kernel(
4 curandStateMtgp32 *state ,
5 short * LATTICE ,
6 short * NEXT_STEP_LATTICE ,
7 int * DEV_MCS_NEEDED ,
8 float * DEV_BOND_DENSITY
9 ) {
10 // Code to be executed by GPU
11 while ( monte_carlo_steps < MAX_MCS ) {
12 if ( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE)==0.0 ) {
13 // stop when lattice is in ferromagnetic state
14 break;
15 }
16 // Rest of the simulation code
17 }
18 }
19 // ...
20 int main(int argc , char *argv []) {
21 // Initializations ...
22 generate kernel<<<gridDim blockDim>>>
23 devMTGPStates ,
24 DEV_LATTICES ,
25 DEV_NEXT_STEP_LATTICES ,
26 DEV_MCS_NEEDED ,
27 DEV_BOND_DENSITY
28 );
29 // Obtaining results
30 // Cleanup
31 }
7.3. Solution space
The important diÄerence over CPU version is the use of (X, Y, Z) which denote
position of the thread in logical structure provided by CUDA architecture. Threads can
be organized inside 3D structures called blocks and indexed using ”Cartesian” combina-
tion of {x, y, z}. Later, they could be referenced inside kernel with blockIdx.{x,y,z}.
Grid is also a 3D structure and similarly to blocks can be referenced inside kernel
18
with gridIdx.{x,y,z}. This structuring is provided for programmer convenience and
is related to GPUs being devices meant for 2D and 3D graphics processing, where
such ”Cartesian” decomposition is quite natural. Although, blocks and grids are logical
structures, they are associated with physical properties of GPUs. This fact can (and
should, whenever possible) be used for problem decomposition in order to optimize
runtime performance.
Here, the (X, Y, Z) correspond to (MIU, SIGMA, W0), that are distributed with
(blockIdx.x, blockIdx.y, threadIdx.x). This was done in order to keep relatively
small number of threads in the block (see subsection 7.4). By this convention each
thread can calculate it’s own set of values of (MIU, SIGMA, W0). Listing 6 shows how
a thread can map it’s coordinates into initial parameters of simulation. For instance,
threads with it’s blockIdx == (100,100,0) will be executing simulations for MIU=1.0
and SIGMA=0.5 if MIU SIZE=100 and SIGMA SIZE=200.
Listing 6. Simulation parameters computation for each thread
1 #define MIU_START 0.0
2 #define MIU_END 1.0
3 #define MIU_SIZE 10
4 #define SIGMA_START 0.0
5 #define SIGMA_END 1.0
6 #define SIGMA_SIZE 10
7 // ...
8 #define X blockIdx.x
9 #define Y blockIdx.y
10 #define Z threadIdx.x
11 #define MAX_X MIU_SIZE
12 #define MAX_Y SIGMA_SIZE
13 // ...
14 __global__ void generate_kernel (
15 curandStateMtgp32 *state ,
16 short * LATTICE ,
17 short * NEXT_STEP_LATTICE ,
18 int * DEV_MCS_NEEDED ,
19 float * DEV_BOND_DENSITY
20 ) {
21 // ...
22 float C = TRIANGLE_DISTRIBUTION (X / MAX_X , Y / MAX_Y);
23 // ...
24 }
25 int main(int argc , char *argv []) {
26 dim3 blockDim(W0_SIZE ,1 ,1);
27 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,1);
28 // ...
29 generate_kernel <<<gridDim , blockDim >>>(
30 // ...
31 )
32 // ...
33 }
19
7.4. Random Number Generators
Important part of every Monte Carlo simulation is randomness. In order to keep the
simulation converge to the actual result, the quality of the Random Number Generator
(RNG) must be high. The de facto standard for scientific MC simulations is Mersenne
Twister10
[13]. There is a version of MT19937 optimized for GPGPU usage11
that was
included into CUDA as a cuRAND library12
. There are, however, some limitations of
built-in MT19937:
• 1 MTGP state per block
• Up to 256 threads per state
• Up to 200 states using included, pre-generated sequences
MT is called with curand uniform(state) and returns floating point number in
range (0, 1]. The values are uniformly distributed in range. To transform this sequence
of uniformly distributed numbers a special function (function-like macro) can be used
(Listing 7).
Listing 7. Transformation of uniform- into triangle distribution
1 #define TRIANGLE_DISTRIBUTION (miu , sigma) ({
2 float start = max(miu -sigma , 0.0);
3 float end = min(miu+sigma , 1.0);
4 float rand = (
5 curand_uniform (& state[BLOCK_ID ])
6 + curand_uniform (& state[BLOCK_ID ])
7 ) / 2.0;
8 ((end -start) * rand) + start;
9 })
7.5. Thread per simulation - static memory
In the algorithm presented in Listing 4 the memory usage is not optimized at all. It
is not only allocated in the global memory space, but also each time the program is run,
the host’s memory has to be allocated copied into device. Listing 8 shows the ineÖcient
memory allocations that occur in thread-per-simulation algorithm from subsection 7.1.
Listing 8. Dynamic allocation of memory
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 short * LATTICE,
10
MT19937, MT
11
MTGP, http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/index.html
12
http://docs.nvidia.com/cuda/curand/device-api-overview.html
20
4 short * NEXT STEP LATTICE,
5 int * DEV_MCS_NEEDED ,
6 float * DEV_BOND_DENSITY
7 )
8 // ...
9 short * DEV_LATTICES;
10 short * DEV_NEXT_STEP_LATTICES ;
11 CUDA_CALL(cudaMalloc(
12 &DEV_LATTICES ,
13 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE
14 ));
15 CUDA_CALL(cudaMalloc(
16 &DEV_NEXT_STEP_LATTICES ,
17 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE
18 ));
19 // ...
20 generate_kernel <<<grid_size , block_size >>>(
21 devMTGPStates ,
22 DEV_LATTICES ,
23 DEV_NEXT_STEP_LATTICES ,
24 DEV_MCS_NEEDED ,
25 DEV_BOND_DENSITY
26 );
If the memory is allocated inside kernel code the need for time-consuming copying
between host and device disappears. It is possible to statically allocate memory in the
device code (Listing 9).
Listing 9. Static memory allocation inside kernel
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 short LATTICE_1[LATTICE_SIZE ];
7 short LATTICE_2[LATTICE_SIZE ];
8 short * LATTICE = LATTICE_1;
9 short * NEXT_STEP_LATTICE = LATTICE_2;
10 // ...
11 }
7.6. Comparison of static and dynamic memory use
Although quite simple, the following optimization does in fact improve the perfor-
mance of the simulations. The results of the static vs dynamic memory allocation are
illustrated in Figure 7.1.
All of the empirical tests of GPU code were done on GeForce GTX 570 GPU with
Intel i7 CPU.
21
Figure 7.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves fitted are 4-th degree polynomials.
Figure 7.2 shows the results conducted for range of 1 up to 6 000 concurrent
simulations. Static memory approach is faster than dynamic memory in every trial
conducted. Moreover, as seen in Figure 7.2, static memory tends to maintain speedup
rather than lose it’s ”velocity” as it is in the case of dynamic memory approach (compare
fitted curves above 40 000 concurrent simulations).
22
Figure 7.2. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves fitted are 4-th degree polynomials.
23
8. GPU Simulations - thread per spin
8.1. Thread per spin approach
CUDA C Best Practices Guide13
encourages the use of multiple threads for optimal
utilization of GPU cores. In this spirit, one can apply the approach where each spin
could be represented by single thread and each simulation takes up an entire block.
This idea is presented in Listing 10.
Listing 10. Thread per spin algorithm
1 while ( monte_carlo_steps < MAX_MCS) {
2 syncthreads();
3 if (threadIdx.x == 0) {
4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);
5 if ( BOND_DENSITY(LATTICE) == 0.0 ) {
6 // If lattice is ferromagnetic , simulation can stop
7 monte_carlo_steps = MAX_MCS;
8 break;
9 }
10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y
));
11 first_i = (int)(LATTICE_SIZE * curand_uniform (& state[BLOCK_ID ]));
12 last_i = (int)(first_i + (C * LATTICE_SIZE));
13 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE);
14 }
15 syncthreads();
16 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x];
17 if (( first_i <= threadIdx.x && threadIdx.x <= last_i )
18 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x ||
threadIdx.x >= first_i ) )
19 ) {
20 short left = MOD(( threadIdx.x-1), LATICE_SIZE);
21 short right = MOD(( threadIdx.x+1), LATICE_SIZE);
22 // Neighbours are the same
23 if ( LATTICE[left] == LATTICE[right] ) {
24 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[left ];
25 }
26 // Otherwise randomly flip the spin
27 else if ( W0 > curand_uniform (& state[BLOCK_ID ])) {
28 NEXT_STEP_LATTICE [threadIdx.x] = FLIP_SPIN(LATTICE[threadIdx.x]);
29 }
30 atomicAdd(&lattice update counter,1);
31 }
32 }
13
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
24
8.2. Concurrent execution
In the approach presented in Listing 10 some new features of CUDA are shown.
Namely, syncthreads(); which can be used to synchronize execution of threads. It
ensures that all threads in block will be executing the same instruction after passing the
syncthreads(); call. Launch of exactly LATTICE SIZE*MIU SIZE*SIGMA SIZE*W0 SIZE
threads is initialized. Each block is exactly LATTICE SIZE long (Listing 11).
Listing 11. Grid and block sizes
1 dim3 blockDim(LATTICE_SIZE ,1 ,1);
2 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,W0_SIZE);
Each thread in block is running a single, block-wide simulation instance and execute
the same code. This introduces the problem - every thread will execute initialization
code like setting up W0, C, MIU etc. Some of this values (like C) are random, therefore,
running this code multiple times will produce diÄerent results. The situation, where even
one spin of the simulation is evaluated according to diÄerent W0 value is unacceptable.
Correct initial setup can be obtained by evaluating initialization by only one thread
(Listing 12).
Listing 12. Shared memory definitions
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 // ...
7 if (threadIdx.x == 0) {
8 LATTICE = LATTICE_1;
9 NEXT_STEP_LATTICE = LATTICE_2;
10 SWAP = NULL;
11 lattice_update_counter =0;
12 monte_carlo_steps =0;
13 W0 = Z/( float)MAX_Z;
14 }
15 __syncthreads ();
16 // ...
17 while ( monte_carlo_steps < MAX_MCS) {
18 // ...
19 }
20 }
Concurrent execution by multiple threads makes initialization of LATTICE easier and
faster. All of the threads are updating their own values. Block’s threads are accessing
memory in bulk and without conflicts which could be a potential source of speedup
(Listing 13).
Listing 13. LATTICE initialization
25
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 // ...
7
8 if (threadIdx.x == 0) {
9 // Initialization
10 }
11 __syncthreads ();
12 // Initialize as antiferromagnetic
13 NEXT STEP LATTICE[threadIdx.x] = threadIdx.x&1;
14 while ( monte_carlo_steps < MAX_MCS) {
15 // ...
16 }
17 }
8.3. Thread communication
To ensure thread cooperation inside simulation, block-level communication is needed.
It can be obtained by means of shared memory. Shared memory is a type of memory
residing on-chip. It is about 100x faster14
than uncached global memory. Shared memory
is accessible to every thread in block.
Listing 14 ilustrates the definition of shared resources inside kernel. CUDA
compiler automatically allocates the on-chip memory for shared variables only once
(though the kernel is executed by every thread). All of the threads in block access the
same place in on-chip memory while accessing shared data.
Listing 14. Shared memory definitions
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 shared unsigned short LATTICE 1[LATTICE SIZE];
7 shared unsigned short LATTICE 2[LATTICE SIZE];
8 shared unsigned short first i, last i;
9 shared unsigned long long int lattice update counter;
10 shared unsigned long monte carlo steps;
11 shared float W0;
12
13 shared unsigned short * LATTICE;
14 shared unsigned short * NEXT STEP LATTICE;
15 shared unsigned short * SWAP;
16
14
http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
26
17 // Initialization of LATTICE pointers , lattice_update_counter etc.
18
19 while ( monte_carlo_steps < MAX_MCS) {
20 // ...
21 }
22 }
8.4. Race conditions with shared memory
The issue of race conditions arise when multiple threads try to write to shared
memory. Writing on GPU is usually not an atomic operation. It actually consist of 3
diÄerent operations15
e.g. the incrementation of some number consist of:
1. Reading the value
2. Incrementing the value
3. Writing the new value
During the time required to perform this steps other threads can interrupt the
execution. Fortunately, CUDA does provide the programmer with set of atomic*()
functions. atomic*() ensures that any number of threads requesting read or write to
the same memory instance will be served properly.
The code presented in Listing 15 shows how to perform lattice update counter
incrementation to ensure correctness of results.
Listing 15. Atomic add
1 while ( monte_carlo_steps < MAX_MCS) {
2 // ...
3 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x];
4 if (( first_i <= threadIdx.x && threadIdx.x <= last_i )
5 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x ||
threadIdx.x >= first_i ) )
6 ) {
7 // ...
8 atomicAdd(&lattice update counter,1);
9 }
10 }
8.5. Thread per spin approach - reduction
Reduction is the process of decreasing the number of elements. This ”definition”,
although vague, means that having multiple elements of some sort, we apply some
process to reduce the number of input elements.
15
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
27
Code presented in Listing 16, computes the bond density of LATTICE in each iteration.
Moreover, it is done sequentially by only one thread which could be ineÖcient.
Listing 16. Unoptimized iteration initialization
1 while ( monte_carlo_steps < MAX_MCS) {
2 __syncthreads ();
3 if (threadIdx.x == 0) {
4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);
5 if ( BOND DENSITY(LATTICE) == 0.0 ) {
6 // If lattice is in ferromagnetic state , simulation can stop
7 monte_carlo_steps = MAX_MCS;
8 break;
9 }
10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y
));
11 first_i = (int)(LATTICE_SIZE * curand_uniform (& state[BLOCK_ID ]));
12 last_i = (int)(first_i + (C * LATTICE_SIZE));
13 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE);
14 }
15 __syncthreads ();
16 // ...
17 }
Let’s recall that bond density (fl = 0) iÄ16
all spins are either in SPIN UP or
SPIN DOWN position. From this simple observation we can conclude that the sum of
all LATTICE elements (represented in code by means of 0 and 1) would be equal to
0 (if all elements are zeroes) or LATTICE SIZE (all elements being ones) if LATTICE is
in the ferromagnetic state. Moreover, we can make use of multiple GPU threads and
use existing NEXT STEP LATTICE as it is not needed between iterations. In algorithm
presented in Listing 17 the sum of the LATTICE elements is calculated in log2(L) steps.
In case of L = 64 it would take 6 iterations, after which the summation results would
be stored in NEXT STEP LATTICE[0].
Listing 17. Parallel reduction
1 for (int i = LATTICE_SIZE /2; i != 0; i /= 2) {
2 if (threadIdx.x < i) {
3 NEXT_STEP_LATTICE [threadIdx.x] += NEXT_STEP_LATTICE [( threadIdx.x+i)];
4 }
5 __syncthreads ();
6 }
The approach from Listing 17 could even be extended to the process of calculating
the actual BOND DENSITY(LATTICE). This method (again, using auxiliary array of
NEXT STEP LATTICE) is presented in Listing 18.
Listing 18. Parallel reduction to calculate BOND DENSITY(LATTICE)
1 __syncthreads ();
16
iÄ - if and only if
28
2 NEXT_STEP_LATTICE [threadIdx.x] = 2*abs(LATTICE[threadIdx.x]-LATTICE [(
threadIdx.x+1) % LATTICE_SIZE ]);
3 __syncthreads ();
4 for (int i = LATTICE_SIZE /2; i > 0; i /= 2) {
5 if (threadIdx.x < i) {
6 // Use NEXT_STEP_LATTICE as cache array
7 NEXT_STEP_LATTICE [threadIdx.x] += NEXT_STEP_LATTICE [threadIdx.x+i];
8 }
9 __syncthreads ();
10 }
8.6. Thread per spin approach - flags
The possible way of optimizing the performance would be to avoid calculating
bond density during each execution of while loop altogether. If during update itera-
tion, none of the spins were updated, then LATTICE was simply shallow copied into
NEXT STEP LATTICE. One can suspect that this behavior could be caused by lattice
being in stationary state. If the stationary state in question is one of ferromagnetic
states the simulation can be stopped.
Listing 19 introduces a new variable: lattice update counter iter. This variable
will hold the information about how many spins were actually changed during simulation
iteration. If the change did occur, then the BOND DENSITY(LATTICE) will not be executed
at all. The if statement’s lattice update counter iter != 0 will be evaluated before
and if it’s condition is not satisfied therefore, (by means of C being lazy-evaluated
language) the part after && will not be reached. If, however, the change did not
occur (lattice update counter iter != 0) and lattice is in ferromagnetic state (BOND
DENSITY(LATTICE) == 0.0) the simulation should stop. Unfortunately, break; will
apply only to threadIdx.x == 0. In order to have other threads stop their work, we
could use check already performed by each thread before starting the actual work, that
is: set the monte carlo steps = MAX MCS. In this way we prevent other threads from
execution (and potentially interfering with the results).
Listing 19. Thread per spin with flags
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 // Shared memory definitions
7 shared lattice update counter iter;
8 // Simulation initialization
9 if (threadIdx.x == 0) {
10 // ...
11 lattice update counter=0;
12 }
29
13 __syncthreads ();
14 // Initialize as antiferromagnetic
15 NEXT_STEP_LATTICE [threadIdx.x] = threadIdx.x&1;
16 while ( monte_carlo_steps < MAX_MCS) {
17 __syncthreads ();
18 if (threadIdx.x == 0) {
19 // Iteration initialization
20 if ( lattice update counter iter == 0
21 && BOND DENSITY(LATTICE) == 0.0 ) {
22 // If ferromagnetic , simulation can stop
23 monte_carlo_steps = MAX_MCS;
24 break;
25 }
26 lattice update counter iter=0;
27 }
28 __syncthreads ();
29 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x];
30 if (( first_i <= threadIdx.x && threadIdx.x <= last_i )
31 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= threadIdx.
x || threadIdx.x >= first_i ) )
32 ) {
33 // Iteration update
34 }
35 if( NEXT_STEP_LATTICE [threadIdx.x]!= LATTICE[threadIdx.x]){
36 atomicAdd(&lattice update counter iter,1);
37 }
38 }
39 }
8.7. Thread-per-spin performance
As seen in Figure 8.1, each improvement over basic thread-per-spin method introduces
some kind of speedup. Noteworthy is the performance gap especially between use
of reduction and flags. Apparently, using flag to avoid per-iteration calculations of
BOND DENSITY(LATTICE) is significantly faster then usage of version equipped with
highly optimized BOND DENSITY(LATTICE) algorithm.
30
Figure 8.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves fitted are 4-th degree polynomials.
8.8. Thread-per-spin vs thread-per-simulation performance
Tests presented in Figure 8.2 and Figure 8.3 show the comparison between suggested
approaches. Unoptimized thread-per-spin approach turns out to be faster then thread-
per-simulation in every test under 20 000 concurrent simulations. Threads on the GPU
do not have as powerful processor at their disposal as those run on CPU. This leads to
the conclusion that most of the tasks conducted on GPU should be split onto separate
threads to parallelize the execution even by the expense of increased communication
time. However, above 20 000 simulations threshold, overhead provided by a huge amount
of threads and RNGs instances causes thread-per-spin to be worse performing than
thread-per-simulation approaches.
31
Figure 8.2. Execution times of thread-per-spin and thread-per-simulation simulations with
MAX MTS equal to 1 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves fitted are 4-th degree polynomials.
Figure 8.3. Execution times of thread-per-spin and thread-per-simulation simulations with
MAX MTS equal to 10 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves fitted are 4-th degree polynomials.
32
As it is in the case of execution times, also speedups of thread-per-simulations
are greater of those using thread-per-spin approach (Figure 8.4 and Figure 8.5). For
massive amounts of concurrent threads thread-per-spin simulations perform relatively
well gaining speedups of about 8-9x. Thread-per-simulation on the other hand shows an
impressive speedup of up to 28x. In the case of low number of threads thread-per-spin
approach shows better speedup below 20 000 threshold. However, for bigger simulations
(25 000 and more) thread-per-simulations show more promising results.
Figure 8.4. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS
equal to 1 000. Markers denote arithmetic mean of the 5 averages conducted.
The curves fitted are 4-th degree polynomials.
33
Figure 8.5. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS
equal to 10 000. Markers denote arithmetic mean of the 5 averages conducted.
The curves fitted are 4-th degree polynomials.
9. Bond density for some W0 values
The calculations made during this project helped in development of some insight
into how triangular distribution could aÄect the phase transition. Some exemplary bond
density (fl) plots are presented in Figure 9.1 and Figure 9.2.
34
Figure 9.1. Bond density after 106 MCS, W0 = 0.9, MIU = [0, 0.25, 0.5, . . . , 1] and
SIGMA = [0, 0.25, 0.5, . . . , 1]
Figure 9.2. Bond density after 106 MCS, W0 = 0.6, MIU = [0, 0.25, 0.5, . . . , 1] and
SIGMA = [0, 0.25, 0.5, . . . , 1]
35
10. Conclusions
CUDA does in fact expose an easy-to-use environment for harnessing the power of
present-day GPGPUs. The realization of the project helped in speeding up complex
and time-consuming calculations that take days on high-end CPUs. It helped getting
to know CUDA compiler and it’s most useful libraries.
Another important (although, not mentioned) element of this study was the usage of
scripting languages. Technologies such as Python17
enable: easy work distribution across
GPGPU workstations, harvesting the results, processing the data18
and plotting19
the
results for easy pattern recognition and presentation. Unfortunately, GPU architecture
requires programmer to really know the underlying hardware and various programming
techniques if he or she wants to obtain an optimal performance.
11. Future work
In the future, the developed CUDA program could be used to drive fully featured
study of the physical phenomena described in section 4. In order to do that, more
detailed data has to be gathered, including improved data resolution and higher number
of averages.
17
http://www.python.org/
18
http://www.numpy.org/
19
http://matplotlib.org/
36
References
[1] C. Coulon, et al. Glauber dynamics in a single-chain magnet: From theory to real
systems Phys. Rev. B 69 (2004)
[2] L. Bogani, et al. Single chain magnets: where to from here? J. Mater Chem., 18,
(2008)
[3] H. Miyasaka, et. al. Slow Dynamics of the Magnetization in One- Dimensional
Coordination. Polymers: Single-Chain Magnets Inorg. Chem., 48, (2009)
[4] R.O. Kuzian, et. al. Ca2Y2Cu5O10: the first frustrated quasi-1D ferromagnet close
to criticality, Phys. Rev. Letters, 109, (2012)
[5] K. Sznajd-Weron and S. Krupa. Inflow versus outflow zero-temperature dynamics
in one dimension, Phys. Rev. E 74, 031109 (2006)
[6] F. Radicchi, D. Vilone, and H. Meyer-Ortmanns. Phase Transition between Syn-
chronous and Asynchronous Updating Algorithms, J. Stat. Phys. 129, 593 (2007)
[7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero-
temperature Glauber dynamics under partially synchronous updates, Phys. Rev. E
86, 051113 (2012)
[8] I. G. Yi and B. J. Kim. Phase transition in a one-dimensional Ising ferromagnet
at zero temperature using Glauber dynamics with a synchronous updating mode,
Phys. Rev. E 83, 033101 (2011)
[9] M. Evans, N. Hastings, B. Peacock. Statistical Distributions, 3rd ed. New York:
Wiley, pp. 187-188, (2000)
[10] E. Ising. Beitrag zur Theorie des Ferromagnetismus, Z. Phys. 31: 253-258, (1925)
[11] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller. Equations
of State Calculations by Fast Computing Machines, Journal of Chemical Physics
21 (6): 108-1092, (1953)
[12] W. Lenz, ”Beitrage zum Verstandnis der magnetischen Eigenschaften in festen
Korpern”, Physikalische Zeitschrift 21: 613-615, (1920)
[13] M. Matsumoto and T. Nishimura, ”Mersenne Twister: A 623-dimensionally equidis-
tributed uniform pseudorandom number generator”, ACM Trans. on Modeling and
Computer Simulation Vol. 8, No. 1, January pp.3-30 (1998)
37

Mais conteúdo relacionado

Mais procurados

Tesis de posicionamiento
Tesis de posicionamientoTesis de posicionamiento
Tesis de posicionamientojosesocola27
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingJun Young Park
 
MSc_Thesis_Wake_Dynamics_Study_of_an_H-type_Vertical_Axis_Wind_Turbine
MSc_Thesis_Wake_Dynamics_Study_of_an_H-type_Vertical_Axis_Wind_TurbineMSc_Thesis_Wake_Dynamics_Study_of_an_H-type_Vertical_Axis_Wind_Turbine
MSc_Thesis_Wake_Dynamics_Study_of_an_H-type_Vertical_Axis_Wind_TurbineChenguang He
 
Quantum computation with superconductors
Quantum computation with superconductorsQuantum computation with superconductors
Quantum computation with superconductorsGabriel O'Brien
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORNVIDIA Japan
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Zurich machine learning_vicensgaitan
Zurich machine learning_vicensgaitanZurich machine learning_vicensgaitan
Zurich machine learning_vicensgaitanVicens Alcalde
 
PFM - Pablo Garcia Auñon
PFM - Pablo Garcia AuñonPFM - Pablo Garcia Auñon
PFM - Pablo Garcia AuñonPablo Garcia Au
 
PHM2106-Presentation-Hubbard
PHM2106-Presentation-HubbardPHM2106-Presentation-Hubbard
PHM2106-Presentation-HubbardCharles Hubbard
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Mais procurados (19)

Tesis de posicionamiento
Tesis de posicionamientoTesis de posicionamiento
Tesis de posicionamiento
 
Pid
PidPid
Pid
 
bachelors-thesis
bachelors-thesisbachelors-thesis
bachelors-thesis
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
MSc_Thesis_Wake_Dynamics_Study_of_an_H-type_Vertical_Axis_Wind_Turbine
MSc_Thesis_Wake_Dynamics_Study_of_an_H-type_Vertical_Axis_Wind_TurbineMSc_Thesis_Wake_Dynamics_Study_of_an_H-type_Vertical_Axis_Wind_Turbine
MSc_Thesis_Wake_Dynamics_Study_of_an_H-type_Vertical_Axis_Wind_Turbine
 
Quantum computation with superconductors
Quantum computation with superconductorsQuantum computation with superconductors
Quantum computation with superconductors
 
Alinia_MSc_S2016
Alinia_MSc_S2016Alinia_MSc_S2016
Alinia_MSc_S2016
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
thesis
thesisthesis
thesis
 
Ee380 labmanual
Ee380 labmanualEe380 labmanual
Ee380 labmanual
 
Zurich machine learning_vicensgaitan
Zurich machine learning_vicensgaitanZurich machine learning_vicensgaitan
Zurich machine learning_vicensgaitan
 
Semester_Sebastien
Semester_SebastienSemester_Sebastien
Semester_Sebastien
 
PFM - Pablo Garcia Auñon
PFM - Pablo Garcia AuñonPFM - Pablo Garcia Auñon
PFM - Pablo Garcia Auñon
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
PHM2106-Presentation-Hubbard
PHM2106-Presentation-HubbardPHM2106-Presentation-Hubbard
PHM2106-Presentation-Hubbard
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Project report. fin
Project report. finProject report. fin
Project report. fin
 
Complete (2)
Complete (2)Complete (2)
Complete (2)
 

Semelhante a Final Thesis

An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_comp...
An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_comp...An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_comp...
An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_comp...Kanika Anand
 
AERO390Report_Xiang
AERO390Report_XiangAERO390Report_Xiang
AERO390Report_XiangXIANG Gao
 
MSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverMSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverAkshat Srivastava
 
Power System Stabilizer (PSS) for generator
Power System Stabilizer (PSS) for generatorPower System Stabilizer (PSS) for generator
Power System Stabilizer (PSS) for generatorKARAN TRIPATHI
 
KINEMATICS, TRAJECTORY PLANNING AND DYNAMICS OF A PUMA 560 - Mazzali A., Patr...
KINEMATICS, TRAJECTORY PLANNING AND DYNAMICS OF A PUMA 560 - Mazzali A., Patr...KINEMATICS, TRAJECTORY PLANNING AND DYNAMICS OF A PUMA 560 - Mazzali A., Patr...
KINEMATICS, TRAJECTORY PLANNING AND DYNAMICS OF A PUMA 560 - Mazzali A., Patr...AlessandroMazzali
 
bachelors_thesis_stephensen1987
bachelors_thesis_stephensen1987bachelors_thesis_stephensen1987
bachelors_thesis_stephensen1987Hans Jacob Teglbj
 
Thesis_Eddie_Zisser_final_submission
Thesis_Eddie_Zisser_final_submissionThesis_Eddie_Zisser_final_submission
Thesis_Eddie_Zisser_final_submissionEddie Zisser
 
APPLIED MACHINE LEARNING
APPLIED MACHINE LEARNINGAPPLIED MACHINE LEARNING
APPLIED MACHINE LEARNINGRevanth Kumar
 
PaulHarrisonThesis_final
PaulHarrisonThesis_finalPaulHarrisonThesis_final
PaulHarrisonThesis_finalPaul Harrison
 
TG_PhDThesis_PossPOW_final
TG_PhDThesis_PossPOW_finalTG_PhDThesis_PossPOW_final
TG_PhDThesis_PossPOW_finalTuhfe Göçmen
 
• Sensorless speed and position estimation of a PMSM (Master´s Thesis)
•	Sensorless speed and position estimation of a PMSM (Master´s Thesis)•	Sensorless speed and position estimation of a PMSM (Master´s Thesis)
• Sensorless speed and position estimation of a PMSM (Master´s Thesis)Cesar Hernaez Ojeda
 
Energy notes
Energy notesEnergy notes
Energy notesProf EEE
 

Semelhante a Final Thesis (20)

MScThesis1
MScThesis1MScThesis1
MScThesis1
 
Hoifodt
HoifodtHoifodt
Hoifodt
 
An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_comp...
An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_comp...An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_comp...
An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_comp...
 
Jung.Rapport
Jung.RapportJung.Rapport
Jung.Rapport
 
Manual Gstat
Manual GstatManual Gstat
Manual Gstat
 
AERO390Report_Xiang
AERO390Report_XiangAERO390Report_Xiang
AERO390Report_Xiang
 
MSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverMSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land Rover
 
Power System Stabilizer (PSS) for generator
Power System Stabilizer (PSS) for generatorPower System Stabilizer (PSS) for generator
Power System Stabilizer (PSS) for generator
 
KINEMATICS, TRAJECTORY PLANNING AND DYNAMICS OF A PUMA 560 - Mazzali A., Patr...
KINEMATICS, TRAJECTORY PLANNING AND DYNAMICS OF A PUMA 560 - Mazzali A., Patr...KINEMATICS, TRAJECTORY PLANNING AND DYNAMICS OF A PUMA 560 - Mazzali A., Patr...
KINEMATICS, TRAJECTORY PLANNING AND DYNAMICS OF A PUMA 560 - Mazzali A., Patr...
 
bachelors_thesis_stephensen1987
bachelors_thesis_stephensen1987bachelors_thesis_stephensen1987
bachelors_thesis_stephensen1987
 
Thesis_Eddie_Zisser_final_submission
Thesis_Eddie_Zisser_final_submissionThesis_Eddie_Zisser_final_submission
Thesis_Eddie_Zisser_final_submission
 
APPLIED MACHINE LEARNING
APPLIED MACHINE LEARNINGAPPLIED MACHINE LEARNING
APPLIED MACHINE LEARNING
 
thesis
thesisthesis
thesis
 
PaulHarrisonThesis_final
PaulHarrisonThesis_finalPaulHarrisonThesis_final
PaulHarrisonThesis_final
 
thesis
thesisthesis
thesis
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
 
TG_PhDThesis_PossPOW_final
TG_PhDThesis_PossPOW_finalTG_PhDThesis_PossPOW_final
TG_PhDThesis_PossPOW_final
 
• Sensorless speed and position estimation of a PMSM (Master´s Thesis)
•	Sensorless speed and position estimation of a PMSM (Master´s Thesis)•	Sensorless speed and position estimation of a PMSM (Master´s Thesis)
• Sensorless speed and position estimation of a PMSM (Master´s Thesis)
 
P10 project
P10 projectP10 project
P10 project
 
Energy notes
Energy notesEnergy notes
Energy notes
 

Último

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 

Último (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 

Final Thesis

  • 1. POLITECHNIKA WROC£AWSKA WYDZIA£ INFORMATYKI I ZARZ•DZANIA GPGPU driven simulations of zero-temperature 1D Ising model with Glauber dynamics Daniel Kosalla FINAL THESIS under supervision of Dr inø. Dariusz Konieczny Wroc≥aw 2013
  • 3. Contents 1. Motivation 5 2. Target 5 3. Scope of work 5 4. Theoretical background and proposed model 6 4.1. Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Historic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.4. Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.5. Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.6. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.7. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. General Purpose Graphic Processing Units 10 5.1. History of General Purpose GPUs . . . . . . . . . . . . . . . . . . . . . . 10 5.2. CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6. CPU Simulations 14 6.1. Sequential algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6.2. Random number generation on CPU . . . . . . . . . . . . . . . . . . . . 15 6.3. CPU performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 7. GPU Simulations - thread per simulation 17 7.1. Thread per simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 7.2. Running the simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.3. Solution space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.4. Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.5. Thread per simulation - static memory . . . . . . . . . . . . . . . . . . . 20 7.6. Comparison of static and dynamic memory use . . . . . . . . . . . . . . . 21 8. GPU Simulations - thread per spin 24 8.1. Thread per spin approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8.2. Concurrent execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iii
  • 4. 8.3. Thread communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.4. Race conditions with shared memory . . . . . . . . . . . . . . . . . . . . 27 8.5. Thread per spin approach - reduction . . . . . . . . . . . . . . . . . . . . 27 8.6. Thread per spin approach - flags . . . . . . . . . . . . . . . . . . . . . . 29 8.7. Thread-per-spin performance . . . . . . . . . . . . . . . . . . . . . . . . 30 8.8. Thread-per-spin vs thread-per-simulation performance . . . . . . . . . . . 31 9. Bond density for some W0 values 34 10.Conclusions 36 11.Future work 36 Appendix 38 A. Sequential algorithm - CPU 39 B. Thread per simulation - no optimizations 43 C. Thread per simulation - static memory 48 D. Thread per spin - no optimizations 53 E. Thread per spin - parallel reduction 58 F. Thread per spin - update flag 63 iv
  • 5. 1. Motivation In the presence of recent developments of SCM (Single Chain Magnets) [1–4] the issue of criticality in 1D Ising-like magnet chains has turned out to be an promising field of study [5–8]. Some practical applications has been already suggested [2]. Unfortunately, the details of general mechanism driving this changes in real world is yet to be discovered. Traditionaly, Monte Carlo Simulations regarding Ising model were conducted on CPUs1 . However, in the presence of powerful GPGPU’s2 new trend in scientific computations was started enabling more detailed and even faster calculations. 2. Target The following document describes developed GPGPU applications capable of pro- ducing insights into underlying physical problem, examination of diÄerent approaches of conducting Monte Carlo simulations on GPGPU and comparison between developed parallel GPGPU algorithms and sequential CPU-based approach. 3. Scope of work The scope of this document includes development of 5 parallel GPGPU algorithms, namely: • Thread-per-simulation algorithm • Thread-per-simulation algorithm with static memory • Thread-per-spin algorithm • Thread-per-spin algorithm with flags • Thread-per-spin algorithm with reduction 1 CPU - Central Processing Unit 2 GPGPU - General Purpose Graphics Processing Unit 5
  • 6. 4. Theoretical background and proposed model 4.1. Ising model Although initially proposed by Wilhelm Lenz, it was Ernst Ising[10], who developed a mathematical model for ferromagnetic phenomena. Ising model is usually represented by means of lattice of spins - discrete variables {≠1, 1}, representing magnetic dipole moments of molecules in the material. The spins are then interacting with it’s neighbours, which may cause the phase transition of the whole lattice. 4.2. Historic methods Monte Carlo Simulation (MC) on Ising model consist of a sequence of lattice updates. Traditionally all (synchronous) or single (sequential) spins are updated in each iteration producing the lattice-state for future iterations. The update methods are based on the so called dynamics that are describing spin interactions. 4.3. Updating The idea of partially synchronous updating scheme has been suggested [5–7]. This c-synchronous mode has a fixed parameter of spins being updated in one step-time. However, one can imagine, that the number of updated spins/molecules (often referred to as cL, where: L denotes size of the chain and c œ (0, 1]) is changing as the simulation progresses. If so, then it is either linked to some characteristics of the system or may be expressed with some probability distribution (described in subsection 4.5). This approach of changing c parameter can be applied while choosing spins randomly as well as in cluster (subsection 4.6) but only the later will be considered in this document. 4.4. Simulations In the proposed model cL sequential updating is used with c due to provided distribution. The considered environment consist of one dimensional array of L spins si = ±1. Index of each spin is denoted by i = 1, 2, . . . , L. Periodic boundary conditions are assumed, i.e. sL+1 = s1. It has been shown in [8] that the system under synchronous Glauber dynamics reaches one of two absorbing states - ferromagnetic or antiferromagnetic. Therefore, let’s introduce density of bonds (fl) as an order parameter: fl = Lq i=1 (1 ≠ sisi+1) 2L (4.1) 6
  • 7. As stated in [8] phase transitions in synchronous updating modes and c-sequential [7] ought to be rather continuous (in cases diÄerent then c = 1 for the later). Smooth phase transition can be observed in the Figure 4.1. Figure 4.1. The average density of active bonds in the stationary state < flst > as a function of W0 for c = 0.9 and several lattice sizes L. [7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero- temperature Glauber dynamics under partially synchronous updates, Phys. Rev. E 86, 051113 (2012) The system is considered in low temperatures (T) and therefore T = 0 can be assumed. Metropolis algorithm can be considered as a special case of zero-temperature Glauber dynamics for 1/2 spins. Each spin is flipped (si = ≠si) with rate W(”E) per unit time. While T = 0: W(”E) = Y ____] ____[ 1 if ”E < 0, W0 if ”E = 0, 0 if ”E > 0 (4.3) In the case of T = 0, the ordering parameter W0 = [0; 1] (e.g. Glauber rate - W0 = 1/2 or Metropolis rate W0 = 1) is assumed to be constant. One can imagine that even W0 parameter can in fact be changed during simulation process but that’s out of scope of proposed model. System starts in the fully ferromagnetic state (fl = flf = 0). After each time-step changes are applied to the system and the next time-step is being evaluated. After predetermined number of time steps state of the system is investigated. If the chain has obtained antiferromagnetic state (fl = flaf = 1) or suÖciently large number of time-steps has been inconclusive then whole simulation is being shout down. 4.5. Distributions During the simulation c will not be fixed in time but rather change from [0; 1] according to triangular continuous probability distribution[9] presented in the Figure 4.2. While studying diÄerent initial conditions for simulations, distributions are to be adjusted in order to provide peak values in range {0, 1}. This is due to the fact that 7
  • 8. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 Figure 4.2. c could be any value in the interval [0; 1] but is most likely to around value of c = 1/2. Other values possible but their probabilities are inversely proportional to their distance from c = 1/2. the value of 0.5 (as presented in the plot) would mean that in each time-step half of the spins gets to be updated. 4.6. Updating The following algorithms make use of triangular probability distribution to assign appropriate c value before each time step. After (on average) L updated spins each Monte Carlo Step (MCS) can be distinguished. 4.7. Algorithm Transformation of the above-mentioned rules into set of instructions could yield in following description or pseudocode (below): Update cL consecutive spins starting from randomly chosen one. Each change is saved to the new array rather than the old one. After each Stop updated spins are saved and new time-step can be started. 1. Assign c value with given distribution 2. Choose random value of i œ [0, L] 3. max = i + cL 4. si is i-th spin • if si = si+1 · si = si≠1 : – sÕ i = si+1 = si≠1 • otherwise: – Flip si with probability W0 8
  • 9. 5. if i ˛ max • i = i + 1 • Go to step 4 6. Stop 9
  • 10. 5. General Purpose Graphic Processing Units 5.1. History of General Purpose GPUs Traditionally, in desktop computer GPU is highly specialized electronic circuit designed to robustly handle 2D and 3D graphics. In 1992 Silicon Graphics, released OpenGL library. OpenGL was meant as standardised, platform-independent interface for writing 3D graphics. By the mid 1990s an increasing demand for 3D applications appeared in the customer market. It was NVIDIA who developed GeForce 256 and branded it as ”the word’s first GPU”3 . GeForce 256, although, one of many graphical accelerators was one that showed a very rapid increase in the field incorporating many features such as transform and lighting computations directly on graphics processor. The release of GPUs capable of coping with programmable pipelines attracted researchers to explore the possibility of using graphical processors outside their original use scheme. Although, early GPUs of early 2000s were programmable in a way that enabled for pixel manipulation, researchers noticed that since this manipulations could actually represent any kind of operations and pixels could virtually represent any kind of data. In the late 2006 NVIDIA revealed GeForce 8800 GTX, the first GPU built with CUDA Architecture. CUDA Architecture enables programmer to use every arithmetic logic unit4 on the GPU (as opposed to early days of GPGPU when the access to ALUs was granted only via the restricted and complicated interface of OpenGL and DirectX). The new family of GPUs started with 8800 GTX was built with IEEE compliant ALUs capable of single-precision floating-point arithmetics. Moreover, new ALUs were equipped not only with extended set of instructions that could be used in general purpose computing but also enabled for the arbitrary read and write operations to device memory. Few months after the lunch of 8800 GTX NVIDIA published a compiler that took standard C language extended with some additional keywords and transformed it into fully featured GPU code capable of general purpose processing. It is important to stress that currently used CUDA C is by far easier to use then OpenGL/DirectX. Programmers do not have to disguise their data for graphics and can use industry-standard C or even other languages like C#, Java or Python (via appropriate bindings). CUDA in now used in various fields of science raging from medical imaging, fluid dynamics to environmental science and others oÄering enormous, several-orders-of- magnitude speed ups5 . GPUs are not only faster then CPUs in terms of computed data 3 http://www.nvidia.com/page/geforce256.html 4 ALU - Arithmetic Logic Unit 5 http://www.nvidia.com/object/cuda-apps-flash-new-changed.html 10
  • 11. per unit time (e.g. FLOPS6 ) but also in terms of power and cost eÖciency. 5.2. CUDA Architecture The underlying architecture of CUDA is driven by design decisions connected with GPU’s primary purpose, that is graphics processing. Graphics processing is usually highly parallel process. Therefore, GPU also works in parallel fashion. The important distinction can be made into logical and physical layer of GPU architecture. Programmer decomposes computational problem into atomic processes (threads) that can be executed simultaneously. Since this partition usually results in creation of hundreds, thousands or even millions if threads. For programmer convenience threads can be organized inside blocks which in turn are part of blocks. Both, blocks and grids are 3 dimensional structures. This spatial dimensions are introduced for easier problem decomposition. As mentioned before: GPU is meant for graphics processing which is usually related to processing 2D or 3D sets of data. This grouping is associated not only with logical decomposition of problems, but also with physical structure of GPU. A basic unit of execution on GPU is the warp. Warp consist of 32 threads. Each thread in warp belongs to the same block. If the block is bigger then warp size then threads are divided between several warps. The warps are executed on the executional unit called Streaming Multiprocessors (SMs). Each SM executes several warps (not necessarily from the same block). Physically, each SM consist of 8 streaming processors (SP, CUDA cores) and 32 ”basic” ALUs. 8 SPs spend 4 clock cycles executing the same processor instruction enabling 32 threads in warp to execute in parallel. Each of the threads in warp can (and usually do) have diÄerent data supplied to them forming whats known as SIMD7 architecture. 6 FLOPS - Floating Point Operations Per Second 7 Single Instruction, Multiple Data 11
  • 12. Figure 5.1. Grid of thread blocks http://docs.nvidia.com/cuda/cuda-c-programming-guide/ CUDA also provides rich memory hierarchy available for every thread. Each of the memory spaces has it’s own characteristics. The fastest and the smallest memory memory is the per-thread local memory. Unfortunately, local, register-based memory is out of reach for CUDA programmer and is used automatically. Each thread in block can make use of shared memory. This memory can be accessed by diÄerent threads in block and is usually the main medium of inter-thread communication. The slowest memory spaces (but available to every thread) are called global, constant and texture respectively, each of them have diÄerent size and purpose but they are all persistent across kernel launches by the same application. 12
  • 13. Figure 5.2. CUDA memory hierarchy http://docs.nvidia.com/cuda/cuda-c-programming-guide/ 13
  • 14. 6. CPU Simulations 6.1. Sequential algorithm The baseline for presented algorithms will be the sequential, CPU based code. The simulation itself is executed by the algorithm presented in Listing 1. Listing 1. Sequential algorithm for CPU 1 while ( monte_carlo_steps < MAX_MCS) { 2 if ( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE) == 0.0 ) { 3 // If lattice is in ferromagnetic state , simulation can stop 4 break; 5 } 6 float C = TRIANGLE_DISTRIBUTION (MIU , SIGMA); 7 first_i = (int)(LATTICE_SIZE * randomUniform ()); 8 last_i = (int)(first_i + (C * LATTICE_SIZE)); 9 is_lattice_updated = FALSE; // ? 10 for (int i = 0; i < LATTICE_SIZE; i++) { 11 NEXT_STEP_LATTICE [i] = LATTICE[i]; 12 if (( first_i <= i && i <= last_i ) 13 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= i || i >= first_i ) ) 14 ) { 15 int left = MOD((i-1), LATTICE_SIZE); 16 int right = MOD((i+1), LATTICE_SIZE); 17 // Neighbours are the same and different than the current spin 18 if ( LATTICE[left] == LATTICE[right] ) { 19 NEXT_STEP_LATTICE [i] = LATTICE[left ]; 20 } 21 // Otherwise randomly flip the spin 22 else if ( W0 > randomUniform () ) { 23 NEXT_STEP_LATTICE [i] = FLIP_SPIN(LATTICE[i]); 24 } 25 lattice_update_counter ++; 26 } 27 if (LATTICE[i] != NEXT_STEP_LATTICE [i]) { 28 is_lattice_updated = TRUE; 29 } 30 } 31 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE); 32 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP); 33 } The code runs the simulation with initial conditions of MAX MCS, LATTICE SIZE, LATTICE. LATTICE is set to be an array initialized in antiferromagnetic state (which could be represented by a sequence of consecutive ones or zeroes). To explore solution space (the combination of W0, MIU and SIGMA) we run each simulation one after another. C/C++’s operator is in fact remainder from the division and not the modulo operator in the mathematical sense. The most prominent diÄerence is that -1 % LATTICE SIZE 14
  • 15. == -1, whereas: MOD((-1), LATTICE SIZE) == LATTICE SIZE-1. Therefore, while ac- cessing current spin’s neighbours MOD(x,N) macro is used (Listing 2). Listing 2. Modulo function-like macro 1 #define MOD(x, N) (((x < 0) ? ((x % N) + N) : x) % N) 6.2. Random number generation on CPU The CPU code uses GSL8 based Mersenne Twister9 . Usage of GSL-supplied MT is shown in Listing 3. Listing 3. GSL’s Mersenne Twister setup 1 #include <gsl/gsl_rng.h> 2 #include <gsl/gsl_randist.h> 3 // ... 4 const gsl_rng_type * T; 5 gsl_rng * r; 6 // ... 7 double randomUniform () { 8 return gsl_rng_uniform (r); 9 } 10 // ... 11 int main(int argc , char *argv []) { 12 gsl_rng_env_setup (); 13 T = gsl_rng_mt19937 ; 14 r = gsl_rng_alloc (T); 15 long seed = time (NULL) * getpid (); 16 gsl_rng_set(r, seed); 17 // simulation 18 // randomUniform () calls 19 } 6.3. CPU performance The tests of CPU were conducted on quad-core AMD Phenom(tm) II X4 945 Processor with 4GB of RAM. Simulations occupied only one core at the time. The results presented in Figure 6.1 will be used as a baseline for further comparisons (with respective MAX MTS values). 8 GSL - GNU Scientific Library, http://www.gnu.org/software/gsl/ 9 http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms. html 15
  • 16. Figure 6.1. Execution times of CPU simulations with MAX MTS equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 16
  • 17. 7. GPU Simulations - thread per simulation 7.1. Thread per simulation CUDA provides C/C++-like language for executing code on GPU (CUDA C). The code is compiled and CUDA compiler via use of specific language extensions (e.g. device , host ) can distinguish the parts to be executed by CPU(host), GPU(device) or both(global). Listing 4. Thread per simulation algorithm 1 while ( monte_carlo_steps < MAX_MCS ) { 2 if( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE)==0.0) { 3 // stop when lattice is in ferromagnetic state 4 break; 5 } 6 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y)); 7 float W0 = Z / (float)MAX_Z; 8 first_i = (int)(LATTICE_SIZE * RANDOM (& state[BLOCK_ID ])) + THREAD_LATTICE_INDEX ; 9 last_i = (int)(first_i + (C * LATTICE_SIZE)) + THREAD_LATTICE_INDEX ; 10 is_lattice_updated = FALSE; // ? 11 for ( int i = THREAD_LATTICE_INDEX ; i < LATTICE_SIZE+ THREAD_LATTICE_INDEX ; i++ ) { 12 NEXT_STEP_LATTICE [i] = LATTICE[i]; 13 if (( first_i <= i && i <= last_i ) 14 || ( last_i >= LATICE_SIZE+ THREAD_LATICE_INDEX && ( last_i % ( LATICE_SIZE+ THREAD_LATICE_INDEX ) >= i || i >= first_i ) ) 15 ) { 16 int left = MOD((i-1), LATICE_SIZE) + THREAD_LATICE_INDEX ; 17 int right = MOD((i+1), LATICE_SIZE) + THREAD_LATICE_INDEX ; 18 // If neighbours are the same 19 if ( LATTICE[left] == LATTICE[right] ) { 20 NEXT_STEP_LATTICE [i] = LATTICE[left ]; 21 } 22 // ... otherwise randomly flip the spin 23 else if ( W0 > RANDOM (& state[BLOCK_ID ])) { 24 NEXT_STEP_LATTICE [i] = FLIP_SPIN(LATTICE[i]); 25 } 26 lattice_update_counter ++; 27 } 28 if ( LATTICE[i] != NEXT_STEP_LATTICE [i] ) { 29 is_lattice_updated = TRUE; 30 } 31 } 32 monte_carlo_steps =(int)( lattice_update_counter /LATTICE_SIZE); 33 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP); 34 } 17
  • 18. 7.2. Running the simulation In order for the CUDA compiler and then GPU to execute the code correctly, programmer has to follow some conventions for the program structure. For instance: creating functions to be executed on GPU has to be prefixed with global or device keyword. Moreover, call to a GPU function has to be done with <<<gridDim, blockDim>>>. The framework for executing code on GPU is shown in Listing 5. Listing 5. Exemplary foundation of GPU-executed code 1 // Imports 2 // Helper definitions etc. 3 global void generate kernel( 4 curandStateMtgp32 *state , 5 short * LATTICE , 6 short * NEXT_STEP_LATTICE , 7 int * DEV_MCS_NEEDED , 8 float * DEV_BOND_DENSITY 9 ) { 10 // Code to be executed by GPU 11 while ( monte_carlo_steps < MAX_MCS ) { 12 if ( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE)==0.0 ) { 13 // stop when lattice is in ferromagnetic state 14 break; 15 } 16 // Rest of the simulation code 17 } 18 } 19 // ... 20 int main(int argc , char *argv []) { 21 // Initializations ... 22 generate kernel<<<gridDim blockDim>>> 23 devMTGPStates , 24 DEV_LATTICES , 25 DEV_NEXT_STEP_LATTICES , 26 DEV_MCS_NEEDED , 27 DEV_BOND_DENSITY 28 ); 29 // Obtaining results 30 // Cleanup 31 } 7.3. Solution space The important diÄerence over CPU version is the use of (X, Y, Z) which denote position of the thread in logical structure provided by CUDA architecture. Threads can be organized inside 3D structures called blocks and indexed using ”Cartesian” combina- tion of {x, y, z}. Later, they could be referenced inside kernel with blockIdx.{x,y,z}. Grid is also a 3D structure and similarly to blocks can be referenced inside kernel 18
  • 19. with gridIdx.{x,y,z}. This structuring is provided for programmer convenience and is related to GPUs being devices meant for 2D and 3D graphics processing, where such ”Cartesian” decomposition is quite natural. Although, blocks and grids are logical structures, they are associated with physical properties of GPUs. This fact can (and should, whenever possible) be used for problem decomposition in order to optimize runtime performance. Here, the (X, Y, Z) correspond to (MIU, SIGMA, W0), that are distributed with (blockIdx.x, blockIdx.y, threadIdx.x). This was done in order to keep relatively small number of threads in the block (see subsection 7.4). By this convention each thread can calculate it’s own set of values of (MIU, SIGMA, W0). Listing 6 shows how a thread can map it’s coordinates into initial parameters of simulation. For instance, threads with it’s blockIdx == (100,100,0) will be executing simulations for MIU=1.0 and SIGMA=0.5 if MIU SIZE=100 and SIGMA SIZE=200. Listing 6. Simulation parameters computation for each thread 1 #define MIU_START 0.0 2 #define MIU_END 1.0 3 #define MIU_SIZE 10 4 #define SIGMA_START 0.0 5 #define SIGMA_END 1.0 6 #define SIGMA_SIZE 10 7 // ... 8 #define X blockIdx.x 9 #define Y blockIdx.y 10 #define Z threadIdx.x 11 #define MAX_X MIU_SIZE 12 #define MAX_Y SIGMA_SIZE 13 // ... 14 __global__ void generate_kernel ( 15 curandStateMtgp32 *state , 16 short * LATTICE , 17 short * NEXT_STEP_LATTICE , 18 int * DEV_MCS_NEEDED , 19 float * DEV_BOND_DENSITY 20 ) { 21 // ... 22 float C = TRIANGLE_DISTRIBUTION (X / MAX_X , Y / MAX_Y); 23 // ... 24 } 25 int main(int argc , char *argv []) { 26 dim3 blockDim(W0_SIZE ,1 ,1); 27 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,1); 28 // ... 29 generate_kernel <<<gridDim , blockDim >>>( 30 // ... 31 ) 32 // ... 33 } 19
  • 20. 7.4. Random Number Generators Important part of every Monte Carlo simulation is randomness. In order to keep the simulation converge to the actual result, the quality of the Random Number Generator (RNG) must be high. The de facto standard for scientific MC simulations is Mersenne Twister10 [13]. There is a version of MT19937 optimized for GPGPU usage11 that was included into CUDA as a cuRAND library12 . There are, however, some limitations of built-in MT19937: • 1 MTGP state per block • Up to 256 threads per state • Up to 200 states using included, pre-generated sequences MT is called with curand uniform(state) and returns floating point number in range (0, 1]. The values are uniformly distributed in range. To transform this sequence of uniformly distributed numbers a special function (function-like macro) can be used (Listing 7). Listing 7. Transformation of uniform- into triangle distribution 1 #define TRIANGLE_DISTRIBUTION (miu , sigma) ({ 2 float start = max(miu -sigma , 0.0); 3 float end = min(miu+sigma , 1.0); 4 float rand = ( 5 curand_uniform (& state[BLOCK_ID ]) 6 + curand_uniform (& state[BLOCK_ID ]) 7 ) / 2.0; 8 ((end -start) * rand) + start; 9 }) 7.5. Thread per simulation - static memory In the algorithm presented in Listing 4 the memory usage is not optimized at all. It is not only allocated in the global memory space, but also each time the program is run, the host’s memory has to be allocated copied into device. Listing 8 shows the ineÖcient memory allocations that occur in thread-per-simulation algorithm from subsection 7.1. Listing 8. Dynamic allocation of memory 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 short * LATTICE, 10 MT19937, MT 11 MTGP, http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/index.html 12 http://docs.nvidia.com/cuda/curand/device-api-overview.html 20
  • 21. 4 short * NEXT STEP LATTICE, 5 int * DEV_MCS_NEEDED , 6 float * DEV_BOND_DENSITY 7 ) 8 // ... 9 short * DEV_LATTICES; 10 short * DEV_NEXT_STEP_LATTICES ; 11 CUDA_CALL(cudaMalloc( 12 &DEV_LATTICES , 13 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE 14 )); 15 CUDA_CALL(cudaMalloc( 16 &DEV_NEXT_STEP_LATTICES , 17 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE 18 )); 19 // ... 20 generate_kernel <<<grid_size , block_size >>>( 21 devMTGPStates , 22 DEV_LATTICES , 23 DEV_NEXT_STEP_LATTICES , 24 DEV_MCS_NEEDED , 25 DEV_BOND_DENSITY 26 ); If the memory is allocated inside kernel code the need for time-consuming copying between host and device disappears. It is possible to statically allocate memory in the device code (Listing 9). Listing 9. Static memory allocation inside kernel 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 short LATTICE_1[LATTICE_SIZE ]; 7 short LATTICE_2[LATTICE_SIZE ]; 8 short * LATTICE = LATTICE_1; 9 short * NEXT_STEP_LATTICE = LATTICE_2; 10 // ... 11 } 7.6. Comparison of static and dynamic memory use Although quite simple, the following optimization does in fact improve the perfor- mance of the simulations. The results of the static vs dynamic memory allocation are illustrated in Figure 7.1. All of the empirical tests of GPU code were done on GeForce GTX 570 GPU with Intel i7 CPU. 21
  • 22. Figure 7.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. Figure 7.2 shows the results conducted for range of 1 up to 6 000 concurrent simulations. Static memory approach is faster than dynamic memory in every trial conducted. Moreover, as seen in Figure 7.2, static memory tends to maintain speedup rather than lose it’s ”velocity” as it is in the case of dynamic memory approach (compare fitted curves above 40 000 concurrent simulations). 22
  • 23. Figure 7.2. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 23
  • 24. 8. GPU Simulations - thread per spin 8.1. Thread per spin approach CUDA C Best Practices Guide13 encourages the use of multiple threads for optimal utilization of GPU cores. In this spirit, one can apply the approach where each spin could be represented by single thread and each simulation takes up an entire block. This idea is presented in Listing 10. Listing 10. Thread per spin algorithm 1 while ( monte_carlo_steps < MAX_MCS) { 2 syncthreads(); 3 if (threadIdx.x == 0) { 4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP); 5 if ( BOND_DENSITY(LATTICE) == 0.0 ) { 6 // If lattice is ferromagnetic , simulation can stop 7 monte_carlo_steps = MAX_MCS; 8 break; 9 } 10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y )); 11 first_i = (int)(LATTICE_SIZE * curand_uniform (& state[BLOCK_ID ])); 12 last_i = (int)(first_i + (C * LATTICE_SIZE)); 13 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE); 14 } 15 syncthreads(); 16 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x]; 17 if (( first_i <= threadIdx.x && threadIdx.x <= last_i ) 18 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x || threadIdx.x >= first_i ) ) 19 ) { 20 short left = MOD(( threadIdx.x-1), LATICE_SIZE); 21 short right = MOD(( threadIdx.x+1), LATICE_SIZE); 22 // Neighbours are the same 23 if ( LATTICE[left] == LATTICE[right] ) { 24 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[left ]; 25 } 26 // Otherwise randomly flip the spin 27 else if ( W0 > curand_uniform (& state[BLOCK_ID ])) { 28 NEXT_STEP_LATTICE [threadIdx.x] = FLIP_SPIN(LATTICE[threadIdx.x]); 29 } 30 atomicAdd(&lattice update counter,1); 31 } 32 } 13 http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ 24
  • 25. 8.2. Concurrent execution In the approach presented in Listing 10 some new features of CUDA are shown. Namely, syncthreads(); which can be used to synchronize execution of threads. It ensures that all threads in block will be executing the same instruction after passing the syncthreads(); call. Launch of exactly LATTICE SIZE*MIU SIZE*SIGMA SIZE*W0 SIZE threads is initialized. Each block is exactly LATTICE SIZE long (Listing 11). Listing 11. Grid and block sizes 1 dim3 blockDim(LATTICE_SIZE ,1 ,1); 2 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,W0_SIZE); Each thread in block is running a single, block-wide simulation instance and execute the same code. This introduces the problem - every thread will execute initialization code like setting up W0, C, MIU etc. Some of this values (like C) are random, therefore, running this code multiple times will produce diÄerent results. The situation, where even one spin of the simulation is evaluated according to diÄerent W0 value is unacceptable. Correct initial setup can be obtained by evaluating initialization by only one thread (Listing 12). Listing 12. Shared memory definitions 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 // ... 7 if (threadIdx.x == 0) { 8 LATTICE = LATTICE_1; 9 NEXT_STEP_LATTICE = LATTICE_2; 10 SWAP = NULL; 11 lattice_update_counter =0; 12 monte_carlo_steps =0; 13 W0 = Z/( float)MAX_Z; 14 } 15 __syncthreads (); 16 // ... 17 while ( monte_carlo_steps < MAX_MCS) { 18 // ... 19 } 20 } Concurrent execution by multiple threads makes initialization of LATTICE easier and faster. All of the threads are updating their own values. Block’s threads are accessing memory in bulk and without conflicts which could be a potential source of speedup (Listing 13). Listing 13. LATTICE initialization 25
  • 26. 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 // ... 7 8 if (threadIdx.x == 0) { 9 // Initialization 10 } 11 __syncthreads (); 12 // Initialize as antiferromagnetic 13 NEXT STEP LATTICE[threadIdx.x] = threadIdx.x&1; 14 while ( monte_carlo_steps < MAX_MCS) { 15 // ... 16 } 17 } 8.3. Thread communication To ensure thread cooperation inside simulation, block-level communication is needed. It can be obtained by means of shared memory. Shared memory is a type of memory residing on-chip. It is about 100x faster14 than uncached global memory. Shared memory is accessible to every thread in block. Listing 14 ilustrates the definition of shared resources inside kernel. CUDA compiler automatically allocates the on-chip memory for shared variables only once (though the kernel is executed by every thread). All of the threads in block access the same place in on-chip memory while accessing shared data. Listing 14. Shared memory definitions 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 shared unsigned short LATTICE 1[LATTICE SIZE]; 7 shared unsigned short LATTICE 2[LATTICE SIZE]; 8 shared unsigned short first i, last i; 9 shared unsigned long long int lattice update counter; 10 shared unsigned long monte carlo steps; 11 shared float W0; 12 13 shared unsigned short * LATTICE; 14 shared unsigned short * NEXT STEP LATTICE; 15 shared unsigned short * SWAP; 16 14 http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/ 26
  • 27. 17 // Initialization of LATTICE pointers , lattice_update_counter etc. 18 19 while ( monte_carlo_steps < MAX_MCS) { 20 // ... 21 } 22 } 8.4. Race conditions with shared memory The issue of race conditions arise when multiple threads try to write to shared memory. Writing on GPU is usually not an atomic operation. It actually consist of 3 diÄerent operations15 e.g. the incrementation of some number consist of: 1. Reading the value 2. Incrementing the value 3. Writing the new value During the time required to perform this steps other threads can interrupt the execution. Fortunately, CUDA does provide the programmer with set of atomic*() functions. atomic*() ensures that any number of threads requesting read or write to the same memory instance will be served properly. The code presented in Listing 15 shows how to perform lattice update counter incrementation to ensure correctness of results. Listing 15. Atomic add 1 while ( monte_carlo_steps < MAX_MCS) { 2 // ... 3 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x]; 4 if (( first_i <= threadIdx.x && threadIdx.x <= last_i ) 5 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x || threadIdx.x >= first_i ) ) 6 ) { 7 // ... 8 atomicAdd(&lattice update counter,1); 9 } 10 } 8.5. Thread per spin approach - reduction Reduction is the process of decreasing the number of elements. This ”definition”, although vague, means that having multiple elements of some sort, we apply some process to reduce the number of input elements. 15 http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions 27
  • 28. Code presented in Listing 16, computes the bond density of LATTICE in each iteration. Moreover, it is done sequentially by only one thread which could be ineÖcient. Listing 16. Unoptimized iteration initialization 1 while ( monte_carlo_steps < MAX_MCS) { 2 __syncthreads (); 3 if (threadIdx.x == 0) { 4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP); 5 if ( BOND DENSITY(LATTICE) == 0.0 ) { 6 // If lattice is in ferromagnetic state , simulation can stop 7 monte_carlo_steps = MAX_MCS; 8 break; 9 } 10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y )); 11 first_i = (int)(LATTICE_SIZE * curand_uniform (& state[BLOCK_ID ])); 12 last_i = (int)(first_i + (C * LATTICE_SIZE)); 13 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE); 14 } 15 __syncthreads (); 16 // ... 17 } Let’s recall that bond density (fl = 0) iÄ16 all spins are either in SPIN UP or SPIN DOWN position. From this simple observation we can conclude that the sum of all LATTICE elements (represented in code by means of 0 and 1) would be equal to 0 (if all elements are zeroes) or LATTICE SIZE (all elements being ones) if LATTICE is in the ferromagnetic state. Moreover, we can make use of multiple GPU threads and use existing NEXT STEP LATTICE as it is not needed between iterations. In algorithm presented in Listing 17 the sum of the LATTICE elements is calculated in log2(L) steps. In case of L = 64 it would take 6 iterations, after which the summation results would be stored in NEXT STEP LATTICE[0]. Listing 17. Parallel reduction 1 for (int i = LATTICE_SIZE /2; i != 0; i /= 2) { 2 if (threadIdx.x < i) { 3 NEXT_STEP_LATTICE [threadIdx.x] += NEXT_STEP_LATTICE [( threadIdx.x+i)]; 4 } 5 __syncthreads (); 6 } The approach from Listing 17 could even be extended to the process of calculating the actual BOND DENSITY(LATTICE). This method (again, using auxiliary array of NEXT STEP LATTICE) is presented in Listing 18. Listing 18. Parallel reduction to calculate BOND DENSITY(LATTICE) 1 __syncthreads (); 16 iÄ - if and only if 28
  • 29. 2 NEXT_STEP_LATTICE [threadIdx.x] = 2*abs(LATTICE[threadIdx.x]-LATTICE [( threadIdx.x+1) % LATTICE_SIZE ]); 3 __syncthreads (); 4 for (int i = LATTICE_SIZE /2; i > 0; i /= 2) { 5 if (threadIdx.x < i) { 6 // Use NEXT_STEP_LATTICE as cache array 7 NEXT_STEP_LATTICE [threadIdx.x] += NEXT_STEP_LATTICE [threadIdx.x+i]; 8 } 9 __syncthreads (); 10 } 8.6. Thread per spin approach - flags The possible way of optimizing the performance would be to avoid calculating bond density during each execution of while loop altogether. If during update itera- tion, none of the spins were updated, then LATTICE was simply shallow copied into NEXT STEP LATTICE. One can suspect that this behavior could be caused by lattice being in stationary state. If the stationary state in question is one of ferromagnetic states the simulation can be stopped. Listing 19 introduces a new variable: lattice update counter iter. This variable will hold the information about how many spins were actually changed during simulation iteration. If the change did occur, then the BOND DENSITY(LATTICE) will not be executed at all. The if statement’s lattice update counter iter != 0 will be evaluated before and if it’s condition is not satisfied therefore, (by means of C being lazy-evaluated language) the part after && will not be reached. If, however, the change did not occur (lattice update counter iter != 0) and lattice is in ferromagnetic state (BOND DENSITY(LATTICE) == 0.0) the simulation should stop. Unfortunately, break; will apply only to threadIdx.x == 0. In order to have other threads stop their work, we could use check already performed by each thread before starting the actual work, that is: set the monte carlo steps = MAX MCS. In this way we prevent other threads from execution (and potentially interfering with the results). Listing 19. Thread per spin with flags 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 // Shared memory definitions 7 shared lattice update counter iter; 8 // Simulation initialization 9 if (threadIdx.x == 0) { 10 // ... 11 lattice update counter=0; 12 } 29
  • 30. 13 __syncthreads (); 14 // Initialize as antiferromagnetic 15 NEXT_STEP_LATTICE [threadIdx.x] = threadIdx.x&1; 16 while ( monte_carlo_steps < MAX_MCS) { 17 __syncthreads (); 18 if (threadIdx.x == 0) { 19 // Iteration initialization 20 if ( lattice update counter iter == 0 21 && BOND DENSITY(LATTICE) == 0.0 ) { 22 // If ferromagnetic , simulation can stop 23 monte_carlo_steps = MAX_MCS; 24 break; 25 } 26 lattice update counter iter=0; 27 } 28 __syncthreads (); 29 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x]; 30 if (( first_i <= threadIdx.x && threadIdx.x <= last_i ) 31 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= threadIdx. x || threadIdx.x >= first_i ) ) 32 ) { 33 // Iteration update 34 } 35 if( NEXT_STEP_LATTICE [threadIdx.x]!= LATTICE[threadIdx.x]){ 36 atomicAdd(&lattice update counter iter,1); 37 } 38 } 39 } 8.7. Thread-per-spin performance As seen in Figure 8.1, each improvement over basic thread-per-spin method introduces some kind of speedup. Noteworthy is the performance gap especially between use of reduction and flags. Apparently, using flag to avoid per-iteration calculations of BOND DENSITY(LATTICE) is significantly faster then usage of version equipped with highly optimized BOND DENSITY(LATTICE) algorithm. 30
  • 31. Figure 8.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 8.8. Thread-per-spin vs thread-per-simulation performance Tests presented in Figure 8.2 and Figure 8.3 show the comparison between suggested approaches. Unoptimized thread-per-spin approach turns out to be faster then thread- per-simulation in every test under 20 000 concurrent simulations. Threads on the GPU do not have as powerful processor at their disposal as those run on CPU. This leads to the conclusion that most of the tasks conducted on GPU should be split onto separate threads to parallelize the execution even by the expense of increased communication time. However, above 20 000 simulations threshold, overhead provided by a huge amount of threads and RNGs instances causes thread-per-spin to be worse performing than thread-per-simulation approaches. 31
  • 32. Figure 8.2. Execution times of thread-per-spin and thread-per-simulation simulations with MAX MTS equal to 1 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. Figure 8.3. Execution times of thread-per-spin and thread-per-simulation simulations with MAX MTS equal to 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 32
  • 33. As it is in the case of execution times, also speedups of thread-per-simulations are greater of those using thread-per-spin approach (Figure 8.4 and Figure 8.5). For massive amounts of concurrent threads thread-per-spin simulations perform relatively well gaining speedups of about 8-9x. Thread-per-simulation on the other hand shows an impressive speedup of up to 28x. In the case of low number of threads thread-per-spin approach shows better speedup below 20 000 threshold. However, for bigger simulations (25 000 and more) thread-per-simulations show more promising results. Figure 8.4. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS equal to 1 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 33
  • 34. Figure 8.5. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS equal to 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 9. Bond density for some W0 values The calculations made during this project helped in development of some insight into how triangular distribution could aÄect the phase transition. Some exemplary bond density (fl) plots are presented in Figure 9.1 and Figure 9.2. 34
  • 35. Figure 9.1. Bond density after 106 MCS, W0 = 0.9, MIU = [0, 0.25, 0.5, . . . , 1] and SIGMA = [0, 0.25, 0.5, . . . , 1] Figure 9.2. Bond density after 106 MCS, W0 = 0.6, MIU = [0, 0.25, 0.5, . . . , 1] and SIGMA = [0, 0.25, 0.5, . . . , 1] 35
  • 36. 10. Conclusions CUDA does in fact expose an easy-to-use environment for harnessing the power of present-day GPGPUs. The realization of the project helped in speeding up complex and time-consuming calculations that take days on high-end CPUs. It helped getting to know CUDA compiler and it’s most useful libraries. Another important (although, not mentioned) element of this study was the usage of scripting languages. Technologies such as Python17 enable: easy work distribution across GPGPU workstations, harvesting the results, processing the data18 and plotting19 the results for easy pattern recognition and presentation. Unfortunately, GPU architecture requires programmer to really know the underlying hardware and various programming techniques if he or she wants to obtain an optimal performance. 11. Future work In the future, the developed CUDA program could be used to drive fully featured study of the physical phenomena described in section 4. In order to do that, more detailed data has to be gathered, including improved data resolution and higher number of averages. 17 http://www.python.org/ 18 http://www.numpy.org/ 19 http://matplotlib.org/ 36
  • 37. References [1] C. Coulon, et al. Glauber dynamics in a single-chain magnet: From theory to real systems Phys. Rev. B 69 (2004) [2] L. Bogani, et al. Single chain magnets: where to from here? J. Mater Chem., 18, (2008) [3] H. Miyasaka, et. al. Slow Dynamics of the Magnetization in One- Dimensional Coordination. Polymers: Single-Chain Magnets Inorg. Chem., 48, (2009) [4] R.O. Kuzian, et. al. Ca2Y2Cu5O10: the first frustrated quasi-1D ferromagnet close to criticality, Phys. Rev. Letters, 109, (2012) [5] K. Sznajd-Weron and S. Krupa. Inflow versus outflow zero-temperature dynamics in one dimension, Phys. Rev. E 74, 031109 (2006) [6] F. Radicchi, D. Vilone, and H. Meyer-Ortmanns. Phase Transition between Syn- chronous and Asynchronous Updating Algorithms, J. Stat. Phys. 129, 593 (2007) [7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero- temperature Glauber dynamics under partially synchronous updates, Phys. Rev. E 86, 051113 (2012) [8] I. G. Yi and B. J. Kim. Phase transition in a one-dimensional Ising ferromagnet at zero temperature using Glauber dynamics with a synchronous updating mode, Phys. Rev. E 83, 033101 (2011) [9] M. Evans, N. Hastings, B. Peacock. Statistical Distributions, 3rd ed. New York: Wiley, pp. 187-188, (2000) [10] E. Ising. Beitrag zur Theorie des Ferromagnetismus, Z. Phys. 31: 253-258, (1925) [11] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller. Equations of State Calculations by Fast Computing Machines, Journal of Chemical Physics 21 (6): 108-1092, (1953) [12] W. Lenz, ”Beitrage zum Verstandnis der magnetischen Eigenschaften in festen Korpern”, Physikalische Zeitschrift 21: 613-615, (1920) [13] M. Matsumoto and T. Nishimura, ”Mersenne Twister: A 623-dimensionally equidis- tributed uniform pseudorandom number generator”, ACM Trans. on Modeling and Computer Simulation Vol. 8, No. 1, January pp.3-30 (1998) 37