Accelerating economics: how GPUs can save you time and money

Accelerating economics: how GPUs
can save you time and money
Master’s Thesis
Submitted in partial fulfillment of the requirements for the degree of
Master of Science in Quantitative Finance
Author
Laurent Oberholzer
Supervisor
Prof. Dr. Karl Schmedders
Faculty of Economics, Business Administration and Information Technology
University of Zurich
June 17, 2015

Assignment
Master Thesis Title: Accelerating economics: how GPUs can save you time and
money
Program: Banking and Finance, UZH
Student: Laurent Oberholzer
Description:
Graphics processing units – or GPUs as they are more commonly known – are spe-
cialized circuits historically designed to efficiently handle computer graphics. They
are highly parallel computers which can process large amounts of data simultane-
ously. The graphics algorithms for which GPUs have been designed and optimized
share characteristics with other algorithms used in high-performance computing. For
certain well-suited scientific applications, the GPU’s infrastructure has been shown
to achieve substantial speedups. For example, the evaluation of the Black-Scholes
partial differential equation to price financial options has been found to be performed
nearly 200 times faster in parallel on a GPU than serially on a single-core CPU (Buck
2006, “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the
GPU”).
The main goal of this study is to illustrate how hybrid CPU/GPU systems can be
used within computational economics to decrease the execution time of an implemen-
tation of a particular model. We start with a mainstream implementation of Raberto
et al.’s (2001) Genoa Articial Stock Market (”GASM”), an agent-based model which
simulates a financial market in discrete time in which heterogeneous agents trade a
single asset. In order to ensure that it is well-suited for execution on the GPU, the
algorithm used to clear the market according to the authors’ specified mechanism is
given a particular attention. Existing parallel programming interfaces - in particular
the OpenACC standard and Thrust parallel algorithms library - are then deployed in
the code. We aim to show:
– how the codebase of our GASM implementation is adapted to utilize these
technologies;
– how incrementally offloading work to the GPU affects the execution time of our
model;
– how this speedup varies as a function of the problem size (e.g. number of agents,
number of time steps, etc.), i.e. weak scaling; and
– how parameterizing the work distribution within the OpenACC programming
model to increase the number of execution units used impacts this speedup, i.e.
strong scaling.
This study also aims at giving the reader a working knowledge of GPU-based parallel
computing, and when and how it should be used. All source code will be made
freely available under a GNU GPLv3 license to provide a real world example of the
application of the presented concepts.

Executive Summary
Introduction
The increasing prevalence of graphics processing units (GPUs) in high performance
computing make them essential devices in modern computational research. However,
despite the need for computational power in many economic problems and the nu-
merous claims of the GPU’s ability to significantly accelerate parallel applications,
there have been scarce reports of their use in computational economics. Researchers
now have access to a wealth of tools which have democratized the use of the GPU
for general-purpose computations. Indeed, libraries and directive-based APIs such as
Thrust and OpenACC provide a relatively effortless way to write or port code to the
GPU.
This thesis aims at documenting how Raberto et al.’s (2001) Genoa Artificial Stock
Market (GASM), an agent-based model in which heterogeneous agents randomly
trade a single asset in discrete time, was first implemented for sequential execution
on a single CPU core and then ported to the GPU for massively parallel execution.
We demonstrate the extent to which this enables us to simulate the model faster (a
speedup) or with greater numbers of agents than on the CPU (a sizeup), and under
what conditions. With the goal of providing the reader with a basic introduction
to general-purpose computing on GPUs, we briefly overview some basic theoretical
parallel computing concepts, present the main principles of the GPU’s architecture
and emphasize its salient differences with that of the CPU.
Methodology
At each time step, all agents in the GASM submit random buy or sell orders based
on their cash and asset holdings. A market clearing price is then found through
a mechanism matching demand and supply, the key ingredient of the model. We
consider three different algorithms to implement this mechanism, of which Raberto
et al. provide only an abstract definition. The market clearing mechanism basically
involves finding three values within an array of unsorted excess demand values. If the
array is sorted, the three values are adjacent to each other.
The first algorithm calculates all array values and then performs three linear
searches. The second algorithm calculates all array values, sorts them in increas-
ing order and then performs a single binary-type search to find one of the three
values. The third algorithm sorts the inputs to the monotonic excess demand func-
tion, and then performs a single binary-type search while calculating array values on
the fly where necessary using the sorted input array. These algorithms differ mainly
in their efficiency and their ability to be massively parallelized. In particular, the
first two algorithms are relatively inefficient but contain large amounts of parallelism,
making them inherently ill suited to sequential execution on the CPU but well suited
to massively parallel execution on the GPU. The third algorithm, however, is very
efficient but has meager amounts of parallelism and is therefore better suited to the
CPU than the GPU. These differences allow us to highlight the importance of taking
into account the characteristics of the target architecture in algorithm design.
The search and sort algorithms of all three algorithms are implemented sequen-
tially on the CPU using functions from the C++ Standard Template Library, and are
executed in parallel on the GPU using direct counterparts from the Thrust parallel

algorithms library. The excess demand function consists solely of a loop, whose itera-
tions are independent and which can therefore be easily parallelized on the GPU using
OpenACC pragma directives. With the exception of the generation of pseudorandom
numbers by the NVIDIA cuRAND library, the remaining functions within the model
also consist of loops parallelized using OpenACC.
Results
Our results show that while the GPU can significantly accelerate the running time
of our programs, this only holds under certain conditions. In particular, if the model
contains less than 350 agents as in the original specification of Raberto et al. (2001),
no algorithm runs faster on the GPU than on the CPU. This is due to the fact that
none of our algorithms exhibit sufficient parallelism to execute efficiently on the GPU
for such small problem sizes. In fact, given that it inherently has the lowest amount
of parallelism, the third algorithm must even be executed with at least 12 500 agents
on the GPU for any speedup to be attained.
Given sufficient numbers of agents however, the GPU can achieve speedups of up
to 90× on the first two algorithms, and a more modest maximum speedup of 7× on
the third algorithm. Considering the fact that the first two algorithms are ill suited
to the CPU and that we are comparing the performance of a single CPU core versus
an entire GPU with thousands of cores, this casts doubt on the validity of the other
two to three orders of magnitude speedups reported in the literature. Above 12 500
agents, when it is well suited and performs best on both the single CPU core and the
GPU, the third algorithm provides a fairer comparison of the respective capabilities of
the two devices. The 7× speedup is therefore probably a more accurate illustration of
what can be expected from accelerating similar real-life sequential CPU applications
with the GPU.
To isolate the impact of porting individual functions to the GPU, we consider
seven different versions of our three algorithms, each of which incrementally paral-
lelizes additional functions. We observe that most of the functions outside the market
clearing mechanism generally perform worse on the GPU due to their low amounts
of parallelism and low ratio of compute operations to memory operations (arithmetic
intensity). These functions only execute efficiently on the GPU for tens or hundreds
of thousands of agents, for which the model can only be simulated in feasible time
spans with the third algorithm due to its superior efficiency. For any number of
agents however, the order generation loop cannot be performed faster on the GPU
due to its excessive number of if - else statements. This results in a highly diver-
gent control flow, which carries a significant performance penalty as the GPU’s basic
processing elements execute in SIMD fashion and must therefore sequentially execute
each branch path. Although this might be suboptimal from a pure memory transfer
point of view, it might often be best to offload only part of an application to the GPU
while leaving the rest on the CPU, given the difference in the types of computations
for which both architectures are well suited.
Lastly, we analyze how our parallel applications can scale to greater numbers of
processing elements and/or greater amounts of work. Unfortunately, since the Thrust
algorithms used to port its main functions do not allow controlling the number of
processing elements used, the third algorithm had to be excluded from this analysis.
Nonetheless, our OpenACC-compatible compiler offloads work to the GPU using

CUDA, which gives us some control over the use of the GPU’s processing elements and
is explicitly designed to allow parallel applications to easily and transparently scale up
or down. It achieves this by requiring that groups of threads execute independently
and in any order, allowing more or less of them to compute in parallel depending on
the available resources. Within certain hardware limits on the number of concurrent
threads that can reside on the GPU, our results shows that the first two algorithms
scale extremely well. In fact, the achieved speedups correspond practically exactly to
the theoretical maxima predicted by Amdahl’s law.
Conclusion
Two limitations of this study are the comparison of a single CPU core’s performance
versus that of an entire GPU, and the use of unoptimized code on both the CPU
and the GPU. The former limitation under-utilizes most of our 64-core CPU, and
therefore prevents us from making more general statements on the merits of the CPU
versus those of the GPU. Future work could address this by parallelizing loops and
Thrust algorithms over all available CPU cores using the OpenMP system. The latter
limitation is less straightforward to address, and would require rigorously evaluating
the CPU and GPU versions of our code against architecture-specific optimization
techniques. Nonetheless, since no optimization was applied to either platform, this
at least does not invalidate our comparison of the two.
This study has shown that massively parallel execution on the GPU can successfully
speed up one example of an economic application and allow it to solve a larger problem
size, relative to a single CPU core’s sequential execution. While the GPU is not able
to accelerate every type of computation, it is very efficient for applications with high
parallelism, high arithmetic intensity, large datasets, and low control flow divergence.
Given the increasing emergence of computationally intensive economic problems, and
the availability of tools making general-purpose computations on the GPU relatively
easy and accessible, we expect the GPU to become commonplace in computational
economics in the years to come.

Acknowledgments
I am extremely fortunate to have been accompanied by many outstanding people
throughout the extraordinary journey that is the MSc UZH ETH in Quantitative Fi-
nance, which this Master’s thesis concludes, and would like to take this opportunity
to acknowledge their support.
I would first like to thank Prof. Dr. Karl Schmedders for supervising my thesis,
giving me the opportunity to conduct this study and thereby hopefully contribute
to the nascent use of GPUs in computational economics. His lecture on “Computa-
tional Economics and Finance” was the highlight of my Master’s curriculum, if only
for Karl’s exuberant teaching style. More importantly, it was some of the most inter-
esting time I have ever spent in a classroom.
Secondly, I would like to thank Dr. Gregor Reich who mentored me throughout
this Master’s thesis, and with whom I had great pleasure working. Not only is Gre-
gor a very friendly person with whom I partook in many intellectually challenging
discussions, but his knowledge, experience and enthusiasm were invaluable in guiding
me through this work.
In addition, my time in Zurich would not have been the same without many people
I am proud to call my friends. Thank you for being by my side through good times
and bad. In particular, I would like to thank Markus, Pascal, Pavel, Rafaela, Ryan,
Stefan and Tom; as well as my friends from the ETH Entrepreneur Club and espe-
cially Daniel, Jonathan and Wolf. There is no doubt in my mind that they will all
excel in their respective endeavors.
To my dear friends and colleagues Fran¸cois and Pascal, with whom I have already
embarked on my next challenge: thank you for your patience and support. I am
excited about what the future will bring, and I think the best is yet to come.
To my girlfriend Hélène, whose support extends far beyond this Master’s thesis, I
address my deepest thanks. She has been my source of encouragement, inspiration
and joy since the day I met her, and I am a better person for knowing her.
Finally, I would like to thank my family, and in particular my parents, without
whom none of this would have been possible. Their unwavering support and encour-
agement since 1991 has enabled everything I have achieved and become.
vi

Contents
1. Introduction 1
2. Parallel Computing 3
2.1. Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1. Parallel Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1.1. Work Partitioning . . . . . . . . . . . . . . . . . . . 3
2.1.1.2. Dependencies . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1.3. Communication . . . . . . . . . . . . . . . . . . . . . 4
2.1.1.4. Synchronization . . . . . . . . . . . . . . . . . . . . . 4
2.1.1.5. Load Balancing . . . . . . . . . . . . . . . . . . . . . 4
2.1.2. Parallel Computers . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2.1. Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . 4
2.1.2.2. Memory Distribution . . . . . . . . . . . . . . . . . . 5
2.1.3. Supercomputers . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1. Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1.1. Exploiting Parallelism . . . . . . . . . . . . . . . . . 7
2.2.1.2. Exploiting Coherence . . . . . . . . . . . . . . . . . . 7
2.2.1.3. Hiding Memory Latency . . . . . . . . . . . . . . . . 8
2.2.1.4. CPU - GPU Connection . . . . . . . . . . . . . . . . 8
2.2.1.5. Scalability . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2. General-Purpose Computing . . . . . . . . . . . . . . . . . . . 9
2.2.2.1. From Fixed-Function to Programmable Shaders . . . 9
2.2.2.2. CPU vs. GPU . . . . . . . . . . . . . . . . . . . . . 10
2.2.2.3. Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3.1. Agent-Based Models . . . . . . . . . . . . . . . . . . 15
3. Genoa Artiﬁcial Stock Market 17
3.1. Order Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1. Limit Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2. Quantity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2. Market Clearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1. Clearing Price . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1.1. Standard Order Books . . . . . . . . . . . . . . . . . 18
3.2.1.2. Pathological Order Books . . . . . . . . . . . . . . . 19
3.2.2. Discarding Excess Demand . . . . . . . . . . . . . . . . . . . . 19
3.3. Trade Settling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4. Serial Implementation 21
4.1. Order Book Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1. Agent Sigmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2. Pseudorandom Number Generation . . . . . . . . . . . . . . . 22
4.1.3. Limit Price and Quantities . . . . . . . . . . . . . . . . . . . . 22
4.2. Market Clearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1. Excess Demand . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2. Algorithm A . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vii

4.2.3. Algorithm B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.4. Algorithm C . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.5. Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . . 28
4.2.5.1. Algorithm A . . . . . . . . . . . . . . . . . . . . . . 28
4.2.5.2. Algorithm B . . . . . . . . . . . . . . . . . . . . . . 29
4.2.5.3. Algorithm C . . . . . . . . . . . . . . . . . . . . . . 29
4.2.5.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.6. Discarding Excess Demand . . . . . . . . . . . . . . . . . . . . 30
4.3. Trade Settling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5. Parallel Implementation 32
5.1. Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2. Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3. Version 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3.1. Excess Demand Evaluations . . . . . . . . . . . . . . . . . . . 36
5.4. Version 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.4.1. Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . . 38
5.4.1.1. Algorithm A . . . . . . . . . . . . . . . . . . . . . . 39
5.4.1.2. Algorithm B . . . . . . . . . . . . . . . . . . . . . . 39
5.4.1.3. Algorithm C . . . . . . . . . . . . . . . . . . . . . . 39
5.4.1.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 40
5.5. Version 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6. Version 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6. Results 43
6.1. Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2. Increasing the Number of Agents . . . . . . . . . . . . . . . . . . . . 45
6.2.1. Algorithms A and B . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1.1. Excess Demand Evaluations . . . . . . . . . . . . . . 45
6.2.1.2. Other Functions . . . . . . . . . . . . . . . . . . . . 45
6.2.1.3. Imbalance . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.2. Algorithm C . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3. Speedup vs. Number of Agents . . . . . . . . . . . . . . . . . . . . . 48
6.4. Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4.1. Strong Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4.1.1. Versions 0-2 . . . . . . . . . . . . . . . . . . . . . . . 51
6.4.1.2. Versions 3-6 . . . . . . . . . . . . . . . . . . . . . . . 53
6.4.2. Weak Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7. Conclusion 56
A. GASM Deﬁnitions 58
A.1. Agent Sigmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.2. Demand and Supply . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.3. Market Clearing Price per Order Book Type . . . . . . . . . . . . . . 58
A.3.1. Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.3.2. Pathological . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
B. Market Clearing Price per Order Book Type, Implementation 58
viii

List of Figures
1. Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. Simpliﬁed Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . . 7
3. CPU vs. GPU: Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 10
4. CPU vs. GPU: Performance . . . . . . . . . . . . . . . . . . . . . . . 11
5. Standard Order Books . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6. Pathological Order Books . . . . . . . . . . . . . . . . . . . . . . . . 20
7. Market Clearing Arrays . . . . . . . . . . . . . . . . . . . . . . . . . 26
8. Market Clearing Arrays - Sorted . . . . . . . . . . . . . . . . . . . . . 27
9. Running Time (s), N = 100 . . . . . . . . . . . . . . . . . . . . . . . 44
10. Proﬁler Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
11. Running Time (s), N = 1000 . . . . . . . . . . . . . . . . . . . . . . . 45
12. Running Time (s), N = 20 000 and N = 100 000 . . . . . . . . . . . . 47
13. Running Time (s) vs. Number of Agents . . . . . . . . . . . . . . . . 48
14. Speedup vs. Number of Agents . . . . . . . . . . . . . . . . . . . . . 49
15. Strong Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
16. Weak Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
List of Tables
1. Order Generation Summary . . . . . . . . . . . . . . . . . . . . . . . 17
2. Market Clearing Algorithms: Summary . . . . . . . . . . . . . . . . . 28
3. Market Clearing Algorithms: Serial Time Complexity . . . . . . . . . 30
4. Algorithm Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5. Core Data Transfers - Version 1 . . . . . . . . . . . . . . . . . . . . . 34
9. Market Clearing Algorithms: Optimal Parallel Time Complexity . . . 40
12. Market Clearing Price and Excess Demands by Order Book Type . . 59
ix

1. Introduction
Computational economics, a field at the intersection of economics and computation,
allows the study of economic issues with far more complexity and realism than tra-
ditional models. Advances in numerical analysis and increasingly powerful computer
hardware have given rise to entire collections of previously intractable economic prob-
lems (Judd, 1998). For example, the analytical modeling of financial markets consist-
ing of heterogeneous agents is very difficult, and must typically resort to modeling
an idealized representative agent (Humphreys, 2009) at the cost of losing economic
believability and robustness (LeBaron, 2006). Computational models, on the other
hand, can quite easily model such complex adaptive systems while reproducing puz-
zling empirical features (LeBaron, 2006).
Today’s desktop computers provide economists with nearly an order of magnitude
more computing power than a leading supercomputer less than two decades ago.
While performance increases were historically achieved by increasing processor tran-
sistor counts and clock speeds, this technique quickly yielded diminishing returns and
generated excessive waste heat. In response, processor manufacturers began integrat-
ing additional processing elements (or cores) on a single chip to speed up execution
with parallel computing (Creel and Goffe, 2008). This concept was borrowed from
supercomputers, which since the mid-60s relied on parallelism in the form of multiple
processing elements and vector processing to accelerate computations and solve larger
problems. While parallel computing has become a fundamental paradigm in modern
high performance computation, and ubiquitous even in personal computers, its up-
take in economics has been relatively slow. This is particularly true for certain classes
of parallel computer architectures such as graphics processing units (GPUs) (Aldrich,
2014). While the optimization of single-processor performance remains paramount
(Hager and Wellein, 2010), the clear trend towards increased parallelism compels
economists to have at least a rudimentary grasp of parallel computing. The returns
can be high: some applications of GPUs in economics have been shown to speed up
the simulation of economic models by up to 200× (Aldrich et al., 2011), although
some have claimed that this is an unfair comparison due to the use of unoptimized
code on the CPU and biased choices of hardware.
The aim of this thesis is to document how a computational economic model,
Raberto et al.’s (2001) Genoa Artificial Stock Market, can make use of parallel com-
puting in a hybrid CPU/GPU workstation, and with what results. In particular, we
aim to answer the following questions:
1. How must the codebase of an existing serial implementation be adapted to
utilize parallel programming techniques and the GPU?
2. How does incrementally parallelizing parts of the program with the GPU speed
up the running time of the model?
3. How does the parallel speedup scale with the size of the problem (weak scaling)?
4. How does the parallel speedup scale with the number of processing units (strong
scaling)?
The rest of this thesis is structured as follows: section 2 introduces some basic princi-
ples of parallel computing, graphics processing units and their use in high performance
1

scientific computing; section 3 defines the Genoa Artificial Stock Market; section 4
documents the serial implementation of the model, and in particular the three dif-
ferent algorithms used to clear the market; section 5 describes how each program is
incrementally parallelized and executed on the GPU; section 6 presents the timing and
scaling results of our serial and parallel implementations; and finally section 7 draws
a conclusion on the work, highlights some of its limitations and provides suggestions
for future research.
2

2. Parallel Computing
This subsection provides a brief summary of Barney’s1
and Oberholzer’s (2013) in-
troduction to parallel computing. It aims at defining some basic notions of parallel
computing used throughout this report, but does not aspire to provide a thorough
introduction to this complex topic.
2.1. Principles
Parallel computing involves the use of multiple compute resources concurrently to
solve a computational problem. This contrasts with serial computing, which only
involves a single compute resource. Parallel computing allows solving problems faster
or with larger input sizes, at the expense however of increased difficulty for the pro-
grammer.
2.1.1. Parallel Algorithms
In addition to single-processor performance concerns as in the classical serial case,
the parallel algorithm designer must consider certain key issues specific to parallel
computing.
2.1.1.1. Work Partitioning
Parallel algorithms distribute work across multiple compute resources in two ways.
1. Domain decomposition (or data parallelism) partitions the data associated to
the problem, where each compute resource performs the same task on its subset
of the data.
2. Functional decomposition (or task parallelism) partitions the tasks associated
to the algorithm, where each compute resource performs a different task on
different data.
Note that domain decomposition is a special case of functional decomposition.
2.1.1.2. Dependencies
Dependencies are the main inhibitors of parallelism, since they impose that the in-
structions (or tasks) which make up the program be executed in a specific order.
Clearly, if an instruction must be performed after another – for example because it
has as input the other instruction’s output – then the two instructions cannot be ex-
ecuted in parallel. The sum of all chains of dependent instructions within a program
constitutes a lower bound on the program’s running time.
If there are few or no dependencies between instructions, the algorithm is said to
be embarassingly parallel. Inherently sequential algorithms, on the other hand, have
dependencies at every instruction.
1
Blaise Barney. Introduction to Parallel Computing. [Online; accessed 09-May-2015]. url: https:
//computing.llnl.gov/tutorials/parallel_comp/
3

2.1.1.3. Communication
Communication between tasks may be required if, for example, a task must communi-
cate its output to another so that the latter may proceed. Since communications use
valuable resources to prepare and carry out, they are costly. In the worst case, commu-
nications are synchronous, requiring tasks to synchronize to negotiate communication
parameters before transmissions. Embarassingly parallel problems require little to no
communication between tasks.
The ratio of computation to communication, granularity, must be chosen according
to the algorithm and the hardware architecture at hand. Since the overhead costs
(e.g. latency, synchronization) of communications are typically high relative to com-
munication costs (bandwidth), high granularity (coarse-grain parallelism) is usually
preferable to low granularity (fine-grain parallelism), at the risk of load imbalance.
2.1.1.4. Synchronization
Synchronization is the coordination of parallel tasks, very often associated with com-
munications. This is typically done by establishing a synchronization point, a barrier,
which other tasks must reach before all participating threads may proceed. Since at
least one task will have to wait at the barrier, synchronization increases a program’s
execution time.
2.1.1.5. Load Balancing
Load balancing involves distributing work to each compute resource so as to avoid
idleness. If all processors perform equally and the workload is deterministic, a simple
load balancing scheme involves distributing the workload evenly across processors.
An alternative scheme consists of splitting the workload into batches which are then
queued. Once a processor finishes working on its batch, it receives a new one. This
minimizes the time that faster processors or those with smaller workloads stay idle.
2.1.2. Parallel Computers
The design of parallel algorithms must also take into account the architecture of the
parallel computer it will be executed on. For the purpose of this report, we classify
parallel computers according first to their number of concurrent instruction and data
streams (Flynn’s Taxonomy), and then according to their memory architecture.
2.1.2.1. Flynn’s Taxonomy
Flynn (1972) introduces a categorization of computer architectures, illustrated in
Figure 1 with figures from Wikipedia2
, based on the number of concurrent instructions
they can execute at each clock cycle; a single instruction (SI) or multiple instructions
(MI), and the number of data streams they can operate on at each clock cycle; a
single data stream (SD) or multiple data streams (MD). “PU” denotes a processing
unit (or processing element).
2
Wikipedia (2015b). Flynn’s taxonomy — Wikipedia, The Free Encyclopedia. [Online; accessed
10-May-2015]. url: http://en.wikipedia.org/w/index.php?title=Flynn’s_taxonomy&
oldid=661352526
4

Figure 1: Flynn’s Taxonomy
(a) SISD
A single processing element executes a
single instruction on a single stream of
data.
DataPool
Instruction Pool
PU
SISD
(b) SIMD
Multiple processing elements execute a
single instruction on multiple streams
of data.
DataPool
Instruction Pool
PU
PU
PU
PU
SIMD
(c) MISD
Multiple processing elements execute
multiple instructions on a single
stream of data.
DataPool
Instruction Pool
PU PU
MISD
(d) MIMD
Multiple processing elements exe-
cute multiple instructions on multiple
streams of data.
DataPool
Instruction Pool
PU
PU
PU
PU
PU
PU
PU
PU
MIMD
The two dominating concepts in modern computing, both personal and high per-
formance, are SIMD and MIMD. SIMD computers include vector processors (which
operate on vectors instead of scalars), the vector units of modern CPUs and GPUs
(Hager and Wellein, 2010). Multi-core processors have MIMD architectures, in addi-
tion to often having SIMD capabilities at the core level.
2.1.2.2. Memory Distribution
In addition to Flynn’s Taxonomy, we further classify parallel computers based on
their memory architecture.
Shared memory computers share (logically and often physically) common mem-
ory across all processors. We speak of uniform memory access (UMA) if the access
time to the shared memory is the same for all processors, and of non-uniform access
(NUMA) otherwise. All processors share the same view of memory.
5

Distributed memory computers have (logically and often physically) separate
memory for each processor. Processors communicate by transmitting messages through
the network that connects them.
Hybrid distributed-shared memory computers combine elements of both shared
and distributed memory computers, e.g. a set of interconnected shared-memory sys-
tems.
2.1.3. Supercomputers
Supercomputers are very high performance computers used for the most challenging
scientific and industrial applications, and as such are at the frontier of high per-
formance computing. Most modern supercomputers are hybrid distributed-shared
memory MIMD systems consisting of a cluster3
of closely interconnected computers
(or nodes), each with multiple multi-core CPUs. For example, the world’s most pow-
erful supercomputer, the Chinese Tianhe-2, has 16 000 nodes, each equipped with 2x
12-core processors, 3x 57-core “coprocessors” and 88 gigabytes of memory4
.
Coprocessors, such as the Intel Xeon Phi installed in the Tianhe-2, are a special
type of hardware accelerators. Hardware accelerators are closely controlled by the
CPU to execute compute-intensive code, with examples including field-programmable
gate arrays (FPGAs) and application specific integrated circuits (ASICs) (Hager and
Wellein, 2010). The most common type of supercomputer accelerator are graphics
processing units (GPUs), which are the main focus of this work. Since 2010, there
has been a clear trend towards the use of accelerators, and in particular GPUs, in
supercomputers. In fact, out of the 500 most powerful commercially available com-
puter systems in November 20145
, 75 (53) were equipped with accelerators (GPUs),
including 11 (7) of the top 25. According to a study by the International Data Cor-
poration6
, the use of accelerators is growing fast, with 76.9% of high performance
computing sites being equipped with them in 2013. GPUs are now therefore clearly
part of the high performance computing landscape.
2.2. Graphics Processing Unit
The graphics processing unit, or GPU, is a key component of any modern computer.
Its earliest ancestors were dedicated hardware accelerators developed to accelerate
graphics processing, mainly in the areas of computer-aided design and computer
games. For example, GPUs are used to render a two-dimensional representation
of a three-dimensional scene through a process known as the graphics pipeline, illus-
trated in Figure 2 (Kaufman, Fan, and Petkov, 2009). The graphics pipeline operates
on the points (vertices) defining the geometric primitives which describe the scene
3
In November 2014, over 85% of the world’s 500 most powerful supercomputers were based on the
cluster architecture. TOP500.org. TOP500 List - November 2014. [Online; accessed 26-Apr-
2015]. url: http://www.top500.org/list/2014/11/
4
TOP500.org. TIANHE-2 (MILKYWAY-2) : NATIONAL UNIVERSITY OF DEFENSE TECH-
NOLOGY. [Online; accessed 11-May-2015]. url: http://www.top500.org/featured/top-
systems/tianhe-2-milkyway-2-national-university-of-defense/
5
TOP500.org. TOP500 List - November 2014. [Online; accessed 26-Apr-2015]. url: http :
//www.top500.org/list/2014/11/
6
International Data Corporation (IDC). IDC SC13 Breakfast Briefing. [Online; accessed 11-May-
2015]. url: http://www.idc.com/downloads/idc_at_sc13_%2011-19-2013_final.pdf
6

(e.g. triangles), along with additional attributes such as texture, color and lighting,
to produce a raster image representation consisting of a rectangular array of discrete
picture elements or pixels (Blythe, 2008).
Figure 2: Simplified Graphics Pipeline
2.2.1. Principles
The graphics pipeline involves performing mathematical operations on a stream of
a very large number of different objects at every step of the pipeline, and is there-
fore inherently a high throughput process. In order to understand the GPU and its
applications, we must first understand the fundamental principles which underly its
design. This subsection summarizes part of Blythe’s (2008) excellent overview of the
GPU.
2.2.1.1. Exploiting Parallelism
There are inherently two levels of parallelism within the graphics pipeline;
• data parallelism, wherein the very large number of objects (e.g. geometric
primitives, vertices, pixel fragments) at each stage of the graphics pipeline can
be processed in parallel, and
• task parallelism, wherein the different steps of the graphics pipeline can be
performed concurrently on different objects through pipelining.
Consequently, the GPU is designed to be a highly parallel computer. A modern GPU
contains multiple processing units, known as streaming multiprocessors in NVIDIA7
GPUs, which can each process a different task in parallel to leverage task parallelism.
Each streaming multiprocessor can further operate on multiple objects in parallel to
leverage data parallelism. For example, the NVIDIA Kepler GK110 GPU contains 15
streaming multiprocessors, each of which contain 192 cores for a total of 2280 cores.
2.2.1.2. Exploiting Coherence
Each streaming multiprocessor within a modern GPU basically operates in SIMD
fashion, executing instructions on multiple data elements (lanes) concurrently. Since
7
The NVIDIA Corporation, a publicly traded company based in California (United States), is one
of the leading manufacturers of graphics processing units.
7

only a single instruction can be executed at a time within a group of lanes, the branch-
ing of the execution flow by a conditional statement (such as an if statement) will
cause some lanes to diverge, and each branch will execute sequentially by the subset
of lanes on that branch. GPUs therefore achieve maximum efficiency when branch
divergence is avoided entirely, i.e. when all lanes execute an identical instruction
stream (this is called computational coherence).
Furthermore, the GPU is designed for high throughput processing; easily dozens to
hundreds of millions of objects per second at every step of the graphics pipeline. With
each object requiring a certain amount of memory, the coherence of memory accesses is
also of very high importance. Since memory bandwidth is precious, memory accesses
should be grouped into as few requests as possible and use the entire block of data
fetched.
2.2.1.3. Hiding Memory Latency
The amount of unique, non-reusable data which is typically used to render a scene
(e.g. tens or hundreds of megabytes just for texture maps) is so large that the CPU’s
conventional strategy of using memory caches to hide memory latency is ineffective.
Instead, the GPU uses smaller memory caches to optimize bandwidth for SIMD mem-
ory requests and hides latency using multithreading. This technique involves main-
taining a pool of threads available for execution which can substitute a thread stalled
by the latency of a memory request. Since the cost of switching between threads is
null, memory latency can thus be completely hidden while the pool contains enough
threads.
2.2.1.4. CPU - GPU Connection
The typical modern high-performance GPU housed in a server or workstation lies
on a graphics card which connects to the motherboard through an expansion slot
such as PCI Express. The GPU is controlled by the CPU but is logically separate
and executes independently. The graphics card features multiple gigabytes of high-
bandwidth random-access memory which is also separate from the system’s main
memory, both logically and physically8
. The peak bandwidth between the GPU and
the device memory is significantly greater than between the device memory and the
system memory. On a NVIDIA Tesla K20c graphics card for example, a high-end
professional graphics card targeted at the high performance computing market, the
former is 208 GB/s while the latter is constrained to 8 GB/s by the PCI Express
expansion slot9
.
2.2.1.5. Scalability
The array of streaming multiprocessors which GPUs are composed of can be scaled to
provide more or less computing power depending on the target application. The mod-
ern programming models used for general-purpose computations on GPUs, which we
8
Throughout this work, to simplify, we will refer to the CPU and its main memory as “CPU”, and
the GPU and the graphics card’s memory as “GPU”.
9
The latest graphics cards, such as the NVIDIA Tesla K80, supports the 3rd generation of PCI
Express which doubles the peak bandwidth with respect to the 2nd generation.
8

shall briefly cover in a later section, are designed to allow applications to transparently
scale to greater numbers of processing elements with great ease to the programmer.
2.2.2. General-Purpose Computing
2.2.2.1. From Fixed-Function to Programmable Shaders
In the early 2000s, GPU manufacturers such as NVIDIA started exposing programma-
bility in the graphics pipeline. The transition from so-called fixed-function pipelines,
which performed fixed - albeit to some extent parameterizable - functions at every
stage of the pipeline, provided application programmers with much more flexibility.
This gave them the ability to write shaders, programs used to process one stage of
the pipeline10
, to specify custom and more sophisticated rendering behavior (Blythe,
2008). These shaders are executed on every object (e.g. pixel fragments or vertices)
at a step of the pipeline, and can therefore be likened to programs invoked on each
point in an input space (Ostrovsky, 2010). For example, the basic linear algebra
floating-point computation SAXPY, consisting of a multiplication of a vector x of
length n by a scalar a and the addition of the result to the vector y , could be
programmed as follows on the CPU:
for (int i=0; i<n; ++i){
y[i] = a*x[i] + y[i];
}
An equivalent shader, executed on every element in the vector y on the GPU, would
simply consist of the following code:
y[idx] = a*x[idx] + y[idx];
where idx is the index of each of the n shading processors executing the shader.
While each shading processor performs a scalar operation on a single element of the
vector (the input space), the n processors taken together perform a vector operation.
One of the first documented uses of the GPU by the scientific research commu-
nity was Larsen and McAllister’s (2001) parallel matrix multiplication algorithm (Du
et al., 2012). The support for floating-point arithmetic in 2002-2003 opened up the
field of general-purpose computing on GPUs - or GPGPU - to a much wider range
of scientific applications requiring greater precision than what the GPU’s historical
integer support could provide. These first applications were developed using shading
languages such as NVIDIA’s Cg, Microsoft’s HLSL or OpenGL’s GLSL, which re-
mained within the realm of computer graphics and required application programmers
to remap their algorithms to graphics concepts11
. The advent of NVIDIA’s Compute
Unified Device Architecture, or CUDA, in 2006 provided developers with a software
environment allowing them to program GPUs using high-level general-purpose pro-
gramming languages such as C, C++ and Fortran, and bolstered the use of GPUs in
high performance computing.
10
For example, pixel/fragment shaders compute the color values of pixel fragments in the fragment
processing step of the pipeline (Ostrovsky, 2010).
11
Simon Green (2008). CUDA 2.1 FAQ. [Online; accessed 26-Apr-2015]. url: https://devtalk.
nvidia.com/default/topic/402275/cuda-programming-and-performance/cuda-2-1-faq-
please-read-before-posting/
9

2.2.2.2. CPU vs. GPU
As illustrated in Figure 3 by NVIDIA12
, GPUs devote a relatively greater share of
their chip area to processing elements (ALUs, arithmetic logic units) compared to
CPUs, at the expense of simpler flow control and fewer data caches.
Figure 3: CPU vs. GPU: Hardware
Furthermore, GPU cores typically run at a lower clock rate than CPU cores. For
example, each of the 2880 cores of the NVIDIA Kepler GK110 runs at 706 MHz,
versus 2500 MHz for each of the 16 cores of an AMD Opteron 6380 CPU. The GPU’s
focus is therefore not on single-thread performance, which is much poorer than on the
CPU, but rather on leveraging its massive number of cores to process many threads
at a time. The following interesting analogy was found on a wiki:
One way to visualize it is a CPU works like a small group of very smart
people who can quickly do any task given to them. A GPU is a large
group of relatively dumb people who aren’t individually very fast or smart,
but who can be trained to do repetitive tasks, and collectively can be more
productive just due to the sheer number of people.13
Figure 4 illustrates the evolution of the performance of NVIDIA GPUs versus that of
Intel CPUs, measured in terms of their number of floating-point operations per second
(FLOPs), as reported by NVIDIA14
. It clearly shows that the GPU has significantly
more theoretical computing power than the CPU, with the divide increasing with
time.
Taken in isolation, the GPU’s substantial performance is a necessary but no longer
sufficient condition for consideration by the high performance computing community.
As the operational costs of supercomputers are now commensurate with acquisition
costs, FLOPs per watt has replaced absolute FLOPS as a central measure of system
performance (Farber, 2015). In addition to decreasing execution time, GPUs have
been praised as more energy efficient than CPUs in many situations (Mittal and
Vetter, 2014). For example, thanks to its hybrid CPU-GPU architecture, the Swiss
National Supercomputing Center’s (CSCS) Piz Daint supercomputer was able to
divide the energy consumed by climate simulations by seven relative to its pure CPU
predecessor15
.
12
NVIDIA Inc. CUDA C Programming Guide. Version 7.0. url: http://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html
13
Why a GPU mines faster than a CPU. [Online; accessed 05-May-2015]. url: https://en.
bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU
14
Image from NVIDIA Inc. CUDA C Programming Guide. Version 7.0. url: http://docs.
nvidia.com/cuda/cuda-c-programming-guide/index.html
15
Swiss National Supercomputing Centre (2014). Fact Sheet - Piz Daint, the first supercomputer
with sustained petaflops-scale performance in Switzerland. [Online; accessed 28-Apr-2015]. url:
http://www.cscs.ch/uploads/tx_factsheet/FSPizDaint_2014_E.pdf
10

Figure 4: CPU vs. GPU: Performance
The GPU’s better energy efficiency on well suited applications stems from the fact
that it requires much less time to finish a given task, leading to a lower energy con-
sumption despite a higher power consumption. Indeed, GPUs consume large amounts
of power; for example, the NVIDIA Tesla K20c draws up to 225W of power16
com-
pared to 115W for an AMD Opteron 6380 CPU. Nonetheless, GPU manufacturers
are paying close attention to the power consumption of their architectures and deliv-
ering substantial improvements with each new generation. The energy efficiency of
NVIDIA’s latest GPU architecture, Maxwell, is double that of the previous generation
Kepler17
, itself triple that of its predecessor Fermi18
.
2.2.2.3. Tools
The ongoing democratization of GPUs in high performance scientific computing would
not have been possible without the emergence of tools allowing non-specialists to make
use of them. These tools can be categorized into three categories19
;
16
NVIDIA Inc. Tesla K20 GPU Accelerator - Board Specification. [Online; accessed 27-Apr-2015].
url: http://international.download.nvidia.com/tesla/pdf/tesla- k20- passive-
board-spec.pdf
17
NVIDIA Inc. Whitepaper - NVIDIA GeForce GTX 750 Ti. [Online; accessed 26-Apr-2015].
url: http://international.download.nvidia.com/geforce-com/international/pdfs/
GeForce-GTX-750-Ti-Whitepaper.pdf500.org/list/2014/11/
18
NVIDIA Inc. Whitepaper - NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
GK110. [Online; accessed 27-Apr-2015]. url: http://www.nvidia.com/content/PDF/kepler/
NVIDIA-kepler-GK110-Architecture-Whitepaper.pdf
18
NVIDIA Inc. CUDA C Programming Guide. Version 7.0. url: http://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html
19
NVIDIA Inc. Learn Parallel Programming. [Online; accessed 03-May-2015]. url: http://www.
nvidia.com/object/learn-parallel-programming.html
11

• directive-based application programming interfaces (APIs) such as OpenACC
and OpenMP20
,
• algorithm libraries such as Thrust, NVIDIA cuBLAS and NVIDIA cuFFT, and
• programming language extensions such as CUDA and OpenCL.
Based on the premise that scientific researchers outside the field of computer science
- and in particular in economics - will wish to leverage the GPU either in existing
applications, or with a minimal learning curve and within a software framework they
are comfortable with, this study focuses on the OpenACC directive-based API and
the Thrust parallel algorithms library.
In the rest of this subsection, we briefly present OpenACC and Thrust, as well as
CUDA which both are based on and which occupies a central role in GPGPU.
OpenACC The OpenACC API - for Open Accelerators - consists of compiler di-
rectives, library routines and environment variables which collectively enable offload-
ing C, C++ and Fortran-based programs from a host CPU to an attached accelerator
device (OpenACC.org, June, 2013). Programmers guide21
the OpenACC-compatible
compiler on how to offload sections of code to the GPU by inserting comments con-
taining recognized directives ( #pragma in C and C++ ). For example, our SAXPY
example can easily be parallelized as follows:
#pragma acc parallel
#pragma acc loop
for(int i=0; i<n; ++i){
y[i] = a*x[i] + y[i];
}
The parallel and loop constructs are fundamental to OpenACC and deserve brief
explaining:
• The parallel construct begins parallel, redundant and independent execution
of the code within the parallel region by each “execution unit”. It may also
specify what data should be copied in and out of the GPU, which data should
be created and deleted, and which data is already present on the GPU.
• The loop construct describes what type of parallelism should be used to execute
the loop. In particular, it allows partitioning the work across “execution units”
to speed up execution, instead of having each “execution unit” executing the
entire workload redundantly.
We have put “execution unit” in quotation marks since OpenACC hides the under-
lying hardware to the programmer, providing him or her instead with an abstract
execution model based on three levels of parallelism. This allows the mapping to a
20
OpenMP supports accelerators since Version 4.0. OpenMP ARB Corporation. Frequently Asked
Questions on OpenMP. [Online; accessed 03-May-2015]. url: http://openmp.org/openmp-
faq.html#Accel
21
Since directives are not prescriptive but rather provide the compiler with hints on desirable be-
havior, pragma-based programming can be thought of as a negotiation that occurs between the
developer and the compiler. Rob Farber. Easy GPU Parallelism with OpenACC. [Online; ac-
cessed 03-May-2015]. url: http://www.drdobbs.com/parallel/easy-gpu-parallelism-
with-openacc/240001776
12

generic architecture consisting of a set of multi-threaded vector processors, such as
GPUs and their array of streaming multiprocessors. The first level of parallelism,
gangs, provides coarse-grain parallelism, with each gang consisting of one or more
workers, the second level. Finally, each worker may also consist of one or more vec-
tor parallelism, the third level, which provides fine-grain SIMD parallelism. The
parallel construct spawns one or more gangs, while the loop construct distributes
work across gang, worker and vector parallelism.
OpenACC-compatible compilers can spare the programmer from explicitly man-
aging data transfers between CPU and GPU memory by automatically moving data
where necessary, while allowing the programmer to override its default behavior. In
the SAXPY example above, the compiler ensures that the scalar a and the arrays
y and x are copied to and from the accelerator. Data directives can be used to
explicitly specify that only the result vector y should be copied in and out of the
GPU, while a and x should only be copied in:
#pragma acc parallel copy(y[0:n]) copyin(x[0:n],a)
#pragma acc loop
for(int i=0; i<n; ++i){
y[i] = a*x[i] + y[i];
}
By requiring only minor modifications to the code-base without the need for the
programmer to manage low-level details such as data locality, OpenACC arguably
provides the simplest way for scientists to leverage the GPU. Finally, since it is
based on directives which are simply ignored by incompatible compilers, OpenACC
also allows applications to be ported to systems without accelerators or without an
OpenACC-compatible compiler.
Enabling OpenACC in the PGI compiler on a suitably equipped and configured
system simply involves compiling the program with the -acc flag. OpenACC doesn’t
actually specify how its directives and programming model should be implemented to
offload work from the host to the device. Rather, this is managed by the OpenACC
API-enabled compiler and runtime environments. The PGI compiler uses CUDA for
NVIDIA GPUs, and AMD OpenCL for AMD GPUs22
.
Thrust Thrust is a library of parallel algorithms and data structures based on
the C++ Standard Template Library (STL). It provides high performance applications
with a high-level interface to technologies such as C++ , CUDA, OpenMP and TBB to
accelerate sorting, transformations, reductions, scans, etc. on parallel architectures
such as GPUs, multiprocessor and multi-core processor systems23
. Thrust observes
a practically identical syntax to that of STL algorithms, and can therefore be easily
deployed in applications already leveraging STL algorithms with minor modifications.
For example, the SAXPY example can be performed with the std::transform algo-
rithm and a custom functor saxpy as follows:
std:: transform(x.begin (),x.end(),y.begin (),y.begin (),saxpy(a));
22
NVIDIA Inc. PGI Compiler User’s Guide. Version 2015. url: http://www.pgroup.com/doc/
pgiug.pdf
23
Thrust Wiki. [Online; accessed 03-May-2015]. url: https://github.com/thrust/thrust/wiki
13

With minor changes to the functor saxpy and the types of the vectors x and y ,
Thrust provides a corresponding thrust::transform algorithm which functions iden-
tically:
thrust :: transform(x.begin (),x.end(),y.begin (),y.begin (),saxpy(a));
Thrust algorithms are essentially parallelized versions of the corresponding STL
algorithms24
. For example, Thrust “delivers 5x to 100x faster sorting performance
than STL”25
and therefore offers an easy way to accelerate an existing application.
Executing Thrust algorithms on an NVIDIA GPU on a suitably equipped and
configured system requires compiling the code with nvcc and including the relevant
Thrust header files.
CUDA While we do not use it directly in this thesis, CUDA underlies both Ope-
nACC and Thrust. Furthermore, it seemingly dominates the current high perfor-
mance computing landscape26
and therefore deserves mentioning.
The core of CUDA consists of extensions to the C language which allow func-
tions, called kernels, to be executed in parallel on the GPU by numerous threads.
Kernels are functions defined using the __global__ specifier and are launched with
the <<<...>>> execution configuration syntax. Much like OpenACC, CUDA defines
a three-level hierarchy for threads. A kernel is executed as a grid of one or more
thread blocks, the first level of parallelism, which partition the problem into coarse
sub-problems. Each thread block consists of one or more warps, the second level,
which consist of 32 threads operating in SIMD lockstep27
, the third level. Concep-
tually, CUDA thread blocks map to OpenACC gangs, warps to workers and threads
to vector lanes28
. The execution configuration syntax specifies the grid and block
dimensions. For example, function<<<10,128>>>(...) launches the kernel function
on a grid of 10 blocks of 128 threads each (128
32
= 4 warps per block).
Our SAXPY example could take the following form in CUDA29
:
__global__ void saxpy(int n, float a, float *x, float *y)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
y[idx] = a*x[idx] + y[idx];
}
...
24
The work complexity of Thrust algorithms is the same as those of the STL algorithms, according
to Jared Hoberock (creator of the Thrust library) in an email conversation with the author of
this thesis.
25
Thrust. [Online; accessed 15-Apr-2015]. url: https://developer.nvidia.com/Thrust
26
Recent statistics by the high performance computing on graphics processing units website suggest
that NVIDIA and CUDA are the dominant GPU hardware platform and language used by the
HPC community. URL: http://hgpu.org/?page_id=3529
27
NVIDIA actually refers to their SIMD architecture as Single Instruction, Multiple Threads
(SIMT). This is due to the fact that the SIMD operation is not exposed to the programmer,
who focuses on the scalar behavior of individual threads rather than reasoning in terms of vector
instructions. This allows her for example to specify branch divergence at the thread level, while
ignoring the SIMD behavior.
28
In practice, the PGI compiler maps OpenACC gangs to 2-dimensional CUDA blocks, workers are
mapped to the second dimension of the blocks and vector lanes to the first. (Mathew Colgrove,
PGI User Forum, URL: http://www.pgroup.com/userforum/viewtopic.php?t=4716)
29
Mark Harris. Six Ways to SAXPY. [Online; accessed 04-May-2015]. url: http://devblogs.
nvidia.com/parallelforall/six-ways-saxpy/
14

n = 1048576; // 4096*256
cudaMemcpy(d_x , x, n, cudaMemcpyHostToDevice );
cudaMemcpy(d_y , y, n, cudaMemcpyHostToDevice );
saxpy <<<4096,256>>>(n, 2.0, x, y);
cudaMemcpy(y, d_y , n, cudaMemcpyDeviceToHost );
After copying the arrays x and y to the device through calls to CUDA’s cudaMemcpy
function, the saxpy kernel is executed on a grid of 4096 blocks of 256 threads each
(256
32
= 8 warps per block). Each thread is assigned an index i and operates on
a single element of the result vector. As the kernel is essentially a custom shader,
invoked on all points in the input space (all elements of the vectors x and y ), the
code above is essentially the same as the SAXPY shader seen in paragraph 2.2.2.1.
2.2.3. Applications
Houston (n.d.) gives a few criteria for determining the suitability of an application to
GPU computing:
1. Large data sets
2. High parallelism
3. Minimal dependencies between data elements
4. High ratio of arithmetic to memory operations (arithmetic intensity)
5. Large amount of work without CPU intervention
In summary, applications which are data parallel and throughput intensive are well
suited for processing on the GPU.
Popular areas for GPU computing include medical imaging, computational fluid
dynamics, environmental science and computational finance, among many others.
However, relatively few applications have been reported in the field of economics
(Aldrich, 2014). Much of the scarce literature has been written in the area of like-
lihood estimation (Creel, Kristensen, et al. (2011), Creal (2012)) and Monte-Carlo
simulation (A. Lee et al. (2009), Geweke and Durham (2011), Durham and Geweke
(2013)). The two most direct applications of GPGPU in computational economics
are Aldrich et al.’s (2011) real business cycle model with value function iteration,
and Aldrich’s (2012) general equilibrium asset pricing model with heterogeneous be-
liefs. One of the goals of this paper is to provide another application of the GPU in
computational economics, focusing in particular on an agent-based model.
2.2.3.1. Agent-Based Models
Agent-based models are used to study systems which:
1. consist of interacting agents; and
2. exhibit emergent properties.
15

Emergent properties arise from agents’ interactions and cannot be deduced by simply
aggregating the individual agents’ properties (Axelrod and Tesfatsion, n.d.). In many
circumstances, such as when agents adapt their behavior in reaction to endogenous
or exogenous perturbations, the dynamics and emergent properties of these systems
cannot be (fully) described by mathematical analysis. In this case, simulation through
agent-based models may be the only viable and practical method.
Agents are generic entities such as traders, insects, governments, stem cells or
companies. Their behavior is based on rules which can range from the simple (a
discrete set of if-then behaviors) to the complex (behaviors which are state-dependent,
anticipatory, intertemporal, etc.). Due to the system’s emergent nature, even simple
behavioral rules can lead to unexpectedly rich global system behaviors (Borrill and
Tesfatsion, 2011).
Agent-based models are used in different fields of research such as biology, sociology
and economics. They are well suited to simulate economic systems such as financial
markets or entire economies, where researchers seek to understand how individuals
behave and interact with one another, as well as the impact of these interactions on the
system as a whole. These models can provide more complex, realistic and analytically
tractable dynamics than the classical Walrasian equilibrium model (Tesfatsion, 2006).
This paper focuses on Raberto et al.’s (2001) agent-based financial market, in which
agents are traders randomly buying or selling a single asset. At every discrete time
step, agents submit orders constrained by their available cash and asset holdings. A
herding phenomenon is introduced to model agent aggregation as the clustering of
nodes in a random graph. The model’s key ingredient is its market clearing mecha-
nism, which specifies how demand and supply are matched through a clearing price.
The resulting price process exhibits some stylized facts of financial time series such
as fat tailed returns and volatility clustering.
Section 3 defines the model in detail.
16

3. Genoa Artificial Stock Market
Raberto et al., 2001 introduce an artificial stock market model called the Genoa
Artificial Stock Market (GASM). Their model consists of many heterogeneous agents
trading a single stock30
randomly against cash in discrete time. The aggregate amount
of cash in the market is constant and is set exogenously at the beginning of the simu-
lation. A market clearing mechanism specifies how supply and demand are matched
through a clearing price. Finally, although the authors introduce a clustering phe-
nomenon to model agent herding, this feature is not implemented in this thesis31
.
The model is able to reproduce stylized facts such as fat tailed returns, zero autocor-
relation of returns and serial autocorrelation of volatility.
3.1. Order Generation
At each time step (or tick) t = 1, . . . , T, agent i = 1, 2, . . . , N holds Ci(t) and Ai(t)
units of cash and stock, respectively (the starting values Ci(0) and Ai(0) ∀i are model
parameters). With given probabilities Pi and 1 − Pi, agent i submits a buy order,
respectively a sell order. An order consists of a limit price pi > 0, and a quantity qi
which is positive for buy orders, and negative for sell orders.
Table 1: Order Generation Summary
Buy order Sell order
pi p(t − 1) ∗ Ni(µ, σi) p(t−1)
Ni(µ,σi)
qi ri ∗ Ci(t−1)
pi
− ri ∗ Ai(t − 1)
3.1.1. Limit Price
The limit price pi is obtained by multiplying (buy order) or dividing (sell order)
the prevailing asset price p(t − 1) by a normally-distributed random32
scaling fac-
tor Ni(µ, σi) ∼ N(µ, σi), and is the maximum (buy order) or minimum (sell order)
acceptable price at which the order can be matched.
The standard deviation σi of the random scaling factor is given by σi := k ∗ σ(Ti),
where k is a positive constant and σ(Ti) (the agent sigma) is the standard deviation
of log returns in the agent’s time window Ti (e.g. 20 steps), defined precisely in the
30
Depsite the name, the model actually does not focus exclusively on stock markets. In fact, the
stock in the model can be understood to be any asset.
31
Since the focus of this thesis lies on computation rather than economic modeling, and given the
time constraints that were imposed on the author, this was done for the sake of time. It is not
clear to what extent abandoning agent herding weakens the model’s ability to exhibit stylized
facts such as volatility clustering or fat tailed returns. In any case, setting aside the clustering
phenomenon does not withdraw substantially from this thesis’ ability to investigate the impact
of GPU computing on an agent-based financial market simulation.
32
All random variables are independent from each other across both agents and time. We assume a
standard filtered probability space (Ω, F, P).
17

Appendix subsection A.1. This introduces a link between asset prices – specifically
volatility – and the distribution of limit prices33
. Furthermore, since Ti is agent-
specific, this also introduces heterogeneity across agents and allows for more or less
myopic behavior within the market. For µ > 1, the limit prices are greater (buy
orders) and less than (sell orders) the prevailing asset price on average, reflecting the
will of trading agents for their orders to be cleared.
We define the set P of limit prices in the market as follows:
P = {pi, i = 1, . . . , N}. (1)
3.1.2. Quantity
The quantity qi of a buy order is a random fraction ri of the number of stocks which
could be bought by the agent at her price pi given her cash holdings Ci(t − 1). The
quantity qi of a sell order is a random fraction ri of her stock holdings Ai(t − 1).
3.2. Market Clearing
Once the order book has been generated, the market must be cleared before trade
can occur. This typically involves finding a price, the clearing (or equilibrium) price,
at which demand and supply are equal, i.e. at which the total amount of stock sold
is equal to the amount of stock bought. The demand ft(p) (supply gt(p)) function at
a price p is defined as the sum of all positive (negative) order quantities with limit
prices greater (less) or equal to p (see subsection A.2 in the Appendix).
3.2.1. Clearing Price
Ideally, we would like to find a price p∗
∈ P such that ft(p∗
) = gt(p∗
), that is, at
which demand and supply are equal. However, since ft and gt are step functions, such
a price generally does not exist.
The authors specify a simple market clearing mechanism in which the clearing
price p∗
is set to be “the price [...] at which the two functions [ft(p) and gt(p)] cross”
(Raberto et al., 2001). While the authors do not precisely define this mechanism
mathematically, we consider below the various scenarios that can occur in the or-
der book and provide a definition of the clearing price in each. Note that precise
mathematical definitions can be found in the Appendix subsection A.3.
3.2.1.1. Standard Order Books
Let us consider the two standard order books illustrated in Figure 5, where the supply
and demand functions “cross” at a single point. Since there is no price p ∈ R at which
demand and supply are equal, we define the clearing price as the limit price at which
the supply function “jumps over” the demand function (Standard 1) or vice versa
(Standard 2).
33
As noted by Raberto et al., this is a consequence of trading psychology, and the impact of high
volatility on uncertainty as to the true price.
18

Figure 5: Standard Order Books
(a) Standard 1
Price
96 96.5 97 97.5 98 98.5
Quantity
3150
3200
3250
3300
3350
3400
3450
3500
3550
3600
f
t
(p)
g
t
(p)
Equilibrium
(b) Standard 2
Price
99 99.5 100 100.5 101 101.5
Quantity
3600
3650
3700
3750
3800
3850
3900
3950
4000
4050
f
t
(p)
g
t
(p)
Equilibrium
3.2.1.2. Pathological Order Books
In certain rare cases34
, two types of pathological order books may arise: the ﬁrst (of
which there are four variants) in which the demand and supply functions are equal
for certain prices, and the second in which there exists no equilibrium price.
Demand Equals Supply Figure 6 illustrates the four variants of the pathological
case in which the demand and supply functions share a common horizontal segment
with abscissas p∗1
and p∗2
. In this case, there exists an interval of prices p ∈ R for
which demand and supply are equal35
, and the authors deﬁne the equilibrium price
as the segment’s midpoint:
p∗
=
p∗1
+ p∗2
2
. (2)
Note that in this case, p∗
∈ P.
No Equilibrium An additional pathological case may arise in which no equilib-
rium exists. That is, in which the supply and demand functions simply do not cross.
This may occur if the demand at the highest limit price is greater than the supply at
that price, or if the demand at the lowest limit price is less than the supply at that
price. In this case, the tick is discarded and a new tick begins.
3.2.2. Discarding Excess Demand
Once the clearing price is found, the excess demand ft(p∗
) − gt(p∗
) at that price is
discarded, ensuring that the total number of stocks in the market remain constant.
The authors do this by randomly choosing and removing ft(p∗
) − gt(p∗
) stocks from
cleared buy orders (pi > p∗
) in case of excess demand ft(p∗
) > gt(p∗
), or from cleared
sell orders (pi < p∗
) in case of excess supply ft(p∗
) < gt(p∗
).
34
On a trial run of 1 000 000 ticks of the GASM, pathological order books occurred in less than 0.5%
of ticks.
35
However, it is still possible that a limit price p ∈ P at which demand and supply are equal still
does not exist, such as in Pathological 4.
19

Figure 6: Pathological Order Books
(a) Pathological 1
Price
98.5 99 99.5 100 100.5 101
Quantity
2750
2800
2850
2900
2950
3000
3050
3100
3150
3200 ft
(p)
gt
(p)
Equilibrium
(b) Pathological 2
Price
80 80.5 81 81.5 82 82.5
Quantity
4300
4350
4400
4450
4500
4550
4600
4650
4700
4750 ft
(p)
gt
(p)
Equilibrium
(c) Pathological 3
Price
59 59.5 60 60.5 61 61.5
Quantity
4150
4200
4250
4300
4350
4400
4450
4500
4550
4600
ft
(p)
gt
(p)
Equilibrium
(d) Pathological 4
Price
58 58.5 59 59.5 60 60.5
Quantity
3250
3300
3350
3400
3450
3500
3550
3600
3650
3700 ft
(p)
gt
(p)
Equilibrium
3.3. Trade Settling
Once the market has been cleared, i.e. a clearing price p∗
= p(t) has been found and
the excess demand or supply at that price has been discarded, trades can actually be
settled. This process involves exchanging cash for shares between the agents whose
orders have been cleared for trade at the clearing price. We start by nullifying all
uncleared orders:
qi =
qi if qi > 0, pi ≥ p(t) or qi < 0, pi ≤ p(t)
0 otherwise
, i = 1, . . . , N (3)
The cash holdings Ci(t) and share holdings Ai(t) of an agent i are then deﬁned as
follows:
Ci(t) = Ci(t − 1) − qi ∗ p(t), (4)
Ai(t) = Ai(t − 1) + qi. (5)
Since the order quantity qi is positive for buy orders, negative for sell orders and
null for uncleared orders, a buying agent has cash subtracted from her cash holdings
and shares added to her share holdings, and vice-versa for a selling agent, while the
holdings of an uncleared agent remain unchanged.
20

4. Serial Implementation
The GASM is relatively straightforward to implement. While a small amount of work
must be done once at the beginning of the model, we focus on the core of the model
which performs the actual simulation. This core consists of a few tasks performed at
each tick, and is described in pseudocode below.
Version 0 GASM core
1: for all tick do
2: Calculate all agent sigmas
3: while p∗
not found do
4: Generate U(0, 1) and N(µ, σi) pseudorandom numbers
5: Generate order book {(qi, pi), i = 1, . . . , N}
6: Find market clearing price p∗
and remove excess demand kt(p∗
) based on
7: excess demand, demand and supply function evaluations
8: end while
9: Nullify uncleared orders
10: Modify agents’ cash and share holdings
11: Calculate and store output variables (price, log return and volume)
12: end for
In this section, we explain how the core of the model is implemented for sequential
execution on the CPU using the C++ programming language. In particular, we focus
on the key tasks which will later be ported to the GPU.
4.1. Order Book Generation
At the beginning of each tick, agents must generate and submit orders which consti-
tute the order book. For each agent, this process involves three functions:
1. Calculate the agent sigma σi, i.e. the standard deviation of log returns over the
agent’s time window Ti,
2. Generate one normally distributed N(µ, σi) and two standard uniformly dis-
tributed U(0, 1) pseudorandom numbers,
3. Calculate and submit the order’s quantity qi and limit price pi.
4.1.1. Agent Sigmas
The calculation of agent sigmas consists in a loop over all unique time windows
Ti. Since there might be fewer unique time windows Ti than agents (such as in
(Raberto et al., 2001) where all agents have the same time window), we calculate the
standard deviation of log returns for every unique time window to avoid redundant
calculations. The constant vector of unique time windows UNIQWIN is computed once
at the beginning of the simulation.
// Calculate standard deviations for all unique time windows
for(int i=0; i<nUniques; ++i){
int window = UNIQWIN[i];
21

// Look as far back as possible
// (might be less than time window at beginning of simulation)
if (tick <= maxWindow +2){
window = (window < tick -1) ? window : tick -1;
}
UNIQSIG[i] = k* standard_deviation (LOGRET ,tick -1-window ,tick -1);
}
4.1.2. Pseudorandom Number Generation
Once all unique agent sigmas have been calculated, we can proceed to generating
pseudorandom numbers (PRNs) for the order generation. For each agent, we generate
three standard uniform PRNs using the rand() function from the C standard library.
Two are used as is for the random fraction of cash/asset holdings ri and to determine
whether the agent is a buyer or a seller, while the third is transformed into a standard
normal N(0, 1) PRN using the Marsaglia polar method36
for the random scaling factor
N(µ, σi).
4.1.3. Limit Price and Quantities
We are now ready to calculate limit prices and quantities for all agents. Note that
since the normal distribution has support x ∈ R, the random scaling factor Ni(µ, σi)
and therefore the limit price pi may be negative. While Raberto et al. do not specify
the model’s behavior in this case, we decide to switch the agent type if the random
scaling factor is negative (i.e. a seller with a negative price would switch to a buyer
and vice-versa). This, however, implies a skew in the agent type distribution with
respect to the model parameters, i.e. P[Agent i is a buyer] = Pi.
for(int i=0; i<nAgents; ++i){
float price , scaling;
int quantity;
scaling = mu + sqrt( UNIQSIG[ UNIQIDX[i] ] )*RANDN[i]; // UNIQIDX[
i] is the index of the agent ’s sigma in UNIQSIG
// Generate order limit prices and quantities
if ( RANDU [2*i] <= PROBAS[i] ){
if ( scaling > 0){
// Agent is buyer
price = lastPrice*scaling;
quantity = floor( RANDU [2*i+1]* CASH[i]/ price );
} else{
// Agent is seller
price = -lastPrice/scaling;
quantity = -floor( RANDU [2*i+1]* SHARES[i] );
}
} else{
if ( scaling > 0){
// Agent is seller
price = lastPrice/scaling;
36
This method generates a pair of standard normal random variables using a pair of standard
uniform random variables, and is therefore performed on agents pairwise.
22

quantity = -floor( RANDU [2*i+1]* SHARES[i] );
} else{
// Agent is buyer
price = -lastPrice*scaling;
quantity = floor( RANDU [2*i+1]* CASH[i]/ price );
}
}
// Submit orders with nonzero quantities
if ( quantity != 0){
ORDERSP[i] = price;
ORDERSQ[i] = quantity;
} else{
ORDERSP[i] = 0;
ORDERSQ[i] = 0;
}
INDICES[i] = i;
}
The orders are then submitted to (stored in) the order book consisting of two
arrays ORDERSQ and ORDERSP of length nAgents , in the order of agents’ indices; row
0 for agent 0, row 1 for agent 1, etc. (we refer to this as being in “agent order”).
Each agent’s index i is also stored in the INDICES array, which we use later in the
implementation of the market clearing mechanism.
4.2. Market Clearing
Once all orders have been generated and stored, the market clearing price p∗
must be
found (if it exists). We have seen in subsection 3.2 that this isn’t entirely straightfor-
ward to do since we are looking for the point at which two step functions “cross”. To
simplify the problem to a single dimension, let us define the excess demand function
kt(p) as the difference between demand and supply:
kt(p) := ft(p) − gt(p). (6)
We are trying to find the limit price at which the demand function crosses under the
supply function, i.e. at which the excess demand kt jumps from positive to negative
(or to zero in case of a pathological order book in which demand equals supply). Let
us consider the smallest limit price p0 ∈ P at which excess demand is nonpositive37
:
p0 := min{p ∈ P : kt(p) ≤ 0}. (7)
37
Note that this limit price p0 ∈ P is not necessarily the first price p ∈ R at which excess demand
is nonpositive. In Pathological 3 for example, the first limit price at which excess demand is
nonpositive is ≈ 86.3, while the first price at which it is nonpositive is the limit as we approach
from the right of ≈ 85.9.
23

Let us also define the first limit prices smaller and greater than p0, denoted pbefore
0
and pafter
0 respectively, as well as the midpoints between p0 and these two limit prices:
pbefore
0 := max{p ∈ P : p < p0}, (8)
pafter
0 := min{p ∈ P : p > p0}, (9)
pbefore
0,mid :=
pbefore
0 + p0
2
, (10)
pafter
0,mid :=
p0 + pafter
0
2
. (11)
It can be shown that we can uniquely determine the market clearing price p∗
∈
{pbefore
0 , pbefore
0,mid , p0, pafter
0,mid} by evaluating the sign of the excess demand function kt(p)
at just the two prices pbefore
0,mid and p0 (see Appendix B). Finding the clearing price p∗
hence boils down to the following computational problem:
1. Find p0,
2. Find pbefore
0 and pafter
0 and calculate pbefore
0,mid and pafter
0,mid,
3. Evaluate the excess demands kt(pbefore
0,mid ) and kt(p0),
4. Set p∗
according to the signs of kt(pbefore
0,mid ) and kt(p0) and Table 12.
While steps 3 and 4 are straightforward to implement, the search problems in steps
1 and 2 are nontrivial and can be solved in multiple ways. Before detailing the three
Algorithms considered to solve steps 1 and 2, let us first begin by implementing the
excess demand function kt(p).
4.2.1. Excess Demand
The excess demand function consists of a for loop over all order quantities, summing
those with positive quantities and limit prices greater or equal to the input price or
with negative quantities and limit prices less or equal to price :
int excessdemand( const float *ORDERSP , const int *ORDERSQ , int
nAgents , const float &price){
int sum = 0;
for (int i=0; i<nAgents; ++i){
if ( (ORDERSQ[i] > 0 && ORDERSP[i] >= price) || (ORDERSQ[i] < 0
&& ORDERSP[i] <= price) ){
sum += ORDERSQ[i];
}
}
return sum;
}
Since buy orders (demand) have positive quantities and sell orders (supply) have
negative quantities, we obtain the difference between the two, the excess demand, by
simply summing the positive quantities of all compatible buy orders with the negative
quantities of all compatible sell orders38
. We also define demand and supply functions,
38
By definition of the limit price, a buy/sell order is compatible with a given price if its limit price
is greater/lower than this price.
24

used to check the existence of a market clearing price, which differ from the above
only in their if statement condition.
4.2.2. Algorithm A
Algorithm A is conceptually the simplest. It consists of the following steps:
1. Evaluate the excess demand function at every limit price and store the results
in an array EXDEM ,
2. Perform a linear search on EXDEM for the largest nonpositive value kt(p0) and
return the corresponding limit price p0 from ORDERSP ,
3. Perform two linear searches on EXDEM for
• the smallest value greater than 0, i.e. kt(pbefore
0 )
• the largest value smaller than kt(p0), i.e. kt(pafter
0 )
and return the corresponding limit prices pbefore
0 and pafter
0 .
We use two algorithms from the C++ Standard Library for the linear searches:
std::max_element to find kt(p0) and kt(pafter
0 ), and std::min_element to find kt(pbefore
0 ).
Since we must be able to evaluate an order’s limit price in case two orders have the
same excess demand, the searches are all performed “on” the INDICES array. For
example, the following code returns an iterator iter1 to the index of p0 in INDICES :
// Find greatest limit price at which the excess demand is less or
equal to zero , using a full enumeration algorithm
std::vector <int >:: iterator iter1 = std:: max_element(INDICES , INDICES
+nAgents , cmp_max(EXDEM ,ORDERSP ,0,0));
We supply custom comparison functions (in the above example cmp_max ) to:
1. compare the excess demands (and limit prices when necessary) at two given
indices and indicate which of the two is larger, and
2. ensure that these algorithms return the greatest value smaller than a given value
(e.g. 0 for kt(p0) or kt(p0) for kt(pafter
0 )), or the smallest value greater than a
certain value (e.g. 0 for kt(pbefore
0 )). Otherwise, these algorithms would simply
return the absolute greatest or smallest value.
Figure 7 gives a numerical example of all arrays used in the market clearing mechanism
of Algorithm A in agent order. The prices of interest p0, pbefore
0 and pafter
0 are scattered
across the ORDERSP array.
4.2.3. Algorithm B
Algorithm B avoids Algorithm A’s two linear searches for pbefore
0 and pafter
0 , and uses
a search algorithm for p0 that is faster than the linear search. However, this comes
at the cost of imposing some order on the input data. This Algorithm consists of the
following steps:
1. Sort INDICES such that it indexes limit prices in decreasing order,
25

Figure 7: Market Clearing Arrays
0 93.03 -204 4160
1 120.71 201 -5470
2 130.96 -21 -6289
...
...
...
...
15 100.78 -129 -202 } pafter
0
...
...
...
...
54 100.42 186 188 } pbefore
0
...
...
...
...
79 100.68 -75 -73 } p0
...
...
...
...
98 103.50 -53 -1515
99
INDICES
122.66
ORDERSP
-12
ORDERSQ
-6082
EXDEM
2. Evaluate the excess demand function at each limit price indexed by INDICES
and store the results in an array EXDEM ,
3. Perform a lower bound search for kt(p0) in EXDEM ,
4. The indices of pbefore
0 and pafter
0 are those adjacent to the index of p0 in the sorted
INDICES array (no further search needed).
The lower bound search algorithm, a variation of the famous binary search al-
gorithm, requires a sorted input array. Thanks to the monotonicity of the excess
demand function39
, sorting ORDERSP in decreasing order and then calculating all cor-
responding excess demand values in the EXDEM array guarantees that these are sorted
in non-decreasing order. We employ the std::sort function from the C++ Standard
Libary, whose underlying algorithm is not specified by the libary’s standards40
.
// Sort INDICES based on ORDERSP
std:: sort(INDICES , INDICES+nAgents , cmp_sort(ORDERSP));
Again, we sort the INDICES array instead of the actual limit prices ORDERSP since it is
simpler to leave the other arrays untouched for the settling of trades (subsection 4.3).
As can be seen in Figure 8, sorting INDICES such that it indexes limit prices in
39
The excess demand function is the difference between a càglàd (“left continuous with right limits”)
and monotonically decreasing function ft and a càdlàg (“right continuous with left limits”) and
monotonically increasing function gt. It is therefore interchangeably either càdlàg or càglàd at
each point of its domain (càllàl, “continuous on one side, limit on the other side”), as can be
seen in the excess demand plots of Figure 5 and Figure 6, and is monotonically decreasing:
kt(p1) ≥ kt(p2) ∀p1, p2 ∈ R+
, p1 ≤ p2, ∀t ∈ {0, 1, . . . , T}.
Definitions from Wikipedia (2015a). Càdlàg — Wikipedia, The Free Encyclopedia. [Online;
accessed 10-Jan-2015]. url: http://en.wikipedia.org/w/index.php?title=C%C3%83%C2%
A0dl%C3%83%C2%A0g&oldid=610070675
40
Instead, the standards guarantee linearithmic time complexity (average case before C++ 11, worst
case since C++ 11). cppreference.com (2015b). std::sort - cppreference.com. [Online; accessed
24-Jan-2015]. url: http://en.cppreference.com/w/cpp/algorithm/sort
26

ORDERSP in decreasing order leads to the excess demands in EXDEM being sorted in
nondecreasing order.
Figure 8: Market Clearing Arrays - Sorted
11 133.95 105 -6336
9 131.02 47 -6289
2 130.96 -21 -6289
93 128.99 113 -6155
69 126.40 25 -6130
...
...
...
...
15 100.78 -129 -202 } pafter
0
79 100.68 -75 -73 } p0
54 100.42 186 188 } pbefore
0
...
...
...
...
10 85.62 105 8241
35 85.09 -111 8241
83 83.90 -247 8352
23 83.57 -18 8599
8 70.75 -44 8617
34
INDICES
68.04
ORDERSP
51
ORDERSQ
8712
EXDEM
Indexed by INDICES , i.e. ORDERSP[INDICES[i]] and ORDERSQ[INDICES[i]] for i = 0, 1, 2, . . .
The lower bound search algorithm is based on the binary search algorithm, and
returns the first value in a list which does not compare less than the search value. It is
much more efficient than linear search on sorted arrays, making at most log2(N)+1
comparisons41
to find the target value instead of linear search’s N comparisons. We
use the std::lower_bound function with a search value of 0. It returns the first element
in EXDEM - sorted in nondecreasing order - which does not compare less or equal to
0, i.e. the smallest positive value kt(pbefore
0 ).
// Find last limit price at which the excess demand is less or equal
to zero
std::vector <int >:: iterator low = std:: lower_bound(EXDEM , EXDEM+
nAgents , 0, std:: less_equal <int >());
Note that since EXDEM has already been calculated with limit prices in decreasing
order, we do not need to take limit prices into consideration and therefore do not
need to apply the search on INDICES as in Programs A and B. Furthermore, once we
have found pbefore
0 , no further search for p0 and pafter
0 is needed as their index values
are the two before that of pbefore
0 in INDICES .
In conclusion, Algorithm B allows us to perform only a single efficient search to
find the three values pbefore
0 , p0 and pafter
0 , at the cost of having to sort the INDICES
array beforehand.
41
cppreference.com (2015a). std::lower bound - cppreference.com. [Online; accessed 25-Jan-2015].
url: http://en.cppreference.com/w/cpp/algorithm/lower_bound
27

4.2.4. Algorithm C
Our last Algorithm is a slight variation on Algorithm B, and stems from the fact that
the lower bound search need only evaluate up to log2(N)+1 excess demand values.
We can therefore speed up Algorithm B by modifying it such that the excess demand
values are not calculated at every limit price, but only the up to log2(N) + 1 < N
values which are required.
This Algorithm consists of the following steps:
1. Sort INDICES such that it indexes limit prices in decreasing order,
2. Perform a lower bound search for the index of pbefore
0 in ORDERSP by calculating
the necessary excess demand values on the fly,
3. The indices of p0 and pafter
0 are the two before that of pbefore
0 in the sorted INDICES
array (no further search needed).
The lower bound search is performed “on” INDICES , since we must evaluate excess
demand at limit prices indexed by INDICES to ensure that these values are ordered
in non-decreasing order.
std::vector <int >:: iterator iter = std:: lower_bound(INDICES , INDICES+
nAgents , 0, cmp_lower_bound (ORDERSP ,ORDERSQ ,nAgents));
The on the fly calculation of the necessary excess demand values in the lower bound
search is performed within the cmp_lower_bound custom comparison function. The
rest of the Algorithm is identical to Algorithm B.
4.2.5. Algorithm Comparison
It isn’t clear a priori which of the three Algorithms, summarized in Table 2, should
perform best. We compare the relative efficiency of the three Algorithms by evaluating
their time complexity expressed in big O notation.
Table 2: Market Clearing Algorithms: Summary
Algorithm A Algorithm B Algorithm C
Sort No Yes Yes
Excess demand
N N ≤ log2(N) + 1
evaluations
Search for p0 Linear Lower bound Lower bound
Search for pbefore
0 Linear − −
Search for pafter
0 Linear − −
4.2.5.1. Algorithm A
Algorithm A starts by evaluating the excess demand at each of the N orders. As
detailed in subsubsection 4.2.1, the excess demand function takes the form of a for
28

loop over the N elements of ORDERSP and ORDERSQ . Since this function is called N
times to calculate the N excess demand values, and by the additive property of the
big O notation42
, the calculation of all excess demand values takes quadratic time
O(N2
). We must then perform three linear searches, each of which consists of a for
loop over all N elements of EXDEM and therefore takes linear time O(N).
Overall, Algorithm A takes quadratic time:
O(N2
)
Excess demand
evaluations
+ O(N)
Seach for
p0
+ O(N)
Seach for
pbefore
0
+ O(N)
Seach for
pafter
0
= O(N2
).
4.2.5.2. Algorithm B
Algorithm B begins by sorting the INDICES array in decreasing order. The std::sort
function used in our implementation takes linearithmic time O(N log N) according
to the C++ Standard Libary documentation.
After the quadratic time excess demand function evaluations, the lower bound
algorithm takes logarithmic time43
O(log N).
Finally, finding the indices of p0 and pafter
0 amounts to looking up the two values
before the index of pbefore
0 and INDICES and therefore takes constant time O(1).
Overall, since N2
> N log N > log N (∀N > 0), Algorithm B takes quadratic time:
O(N log N)
Sort
+ O(N2
)
Excess demand
evaluations
+ O(log N)
Seach for
p0
+ O(1)
Seach for
pbefore
0
+ O(1)
Seach for
pafter
0
= O(N2
).
4.2.5.3. Algorithm C
Finally, Algorithm C differs from Algorithm B only in its on the fly excess de-
mand evaluations in the lower bound algorithm. This algorithm performs at most
log2(N) + 1 comparisons, with each comparison involving a linear time excess de-
mand evaluation. The overall time complexity of the excess demand evaluations is
therefore44
:
j≤ log2(N)+1
i=1
O(N) = O
j≤ log2(N)+1
i=1
N ⊆ O( log2(N) + 1 ∗ N) ⊆ O(N log2 N + N)
= O(N log N).
The rest of Algorithm C is identical to Algorithm B. The Algorithm takes linearith-
mic time overall:
O(N log N)
Sort
+ O(N log N)
Excess demand
evaluations
+ O(log N)
Seach for
p0
+ O(1)
Seach for
pbefore
0
+ O(1)
Seach for
pafter
0
= O(N log N).
42
It can be shown that O f(n) +O g(n) = O f(n)+g(n) , such that
N
i=1 O(N) = O(
N
i=1 N) =
O(N2
).
43
cppreference.com (2015a). std::lower bound - cppreference.com. [Online; accessed 25-Jan-2015].
url: http://en.cppreference.com/w/cpp/algorithm/lower_bound
44
Note that the base of the log is irrelevant, since changing base involves multiplying by a constant
factor which the big O notation ignores: log2 N = log N
log 2 ⇒ O(log2 N) = O( 1
log 2 ∗ log N) =
O(log N).
29

4.2.5.4. Conclusion
Table 3 compares the worst-case sequential time complexities of each market clearing
Algorithm and the diﬀerent functions they consist of.
Table 3: Market Clearing Algorithms: Serial Time Complexity
Algorithm A Algorithm B Algorithm C
Sort − N log N N log N
Excess demand
N2
N2
N log N
evaluations
Search for p0 N log N log N
Search for pbefore
0 N 1 1
Search for pafter
0 N 1 1
SUM N2
N2
N log N
Algorithm C has the lowest time complexity, and should therefore be the best
performer. While Programs A and B share the same time complexity expressed in
big O notation, one mustn’t forget that this notation ignores multiples and therefore
doesn’t penalize Algorithm A for its three linear searches. Setting aside the excess
demand evaluations, Algorithm B has a lower time complexity than Algorithm A,
taking logarithmic time versus quadratic time. We therefore expect Algorithm C to
be faster than Algorithm B, itself faster than Algorithm A.
4.2.6. Discarding Excess Demand
If the order book is of type Standard 1 or Standard 2, there is still negative or positive
excess demand at the market clearing price p∗
after the market clearing price p∗
has
been found. To remove the excess demand kt(p∗
) and thereby reach an equilibrium,
we simply subtract the excess demand from the quantity of the order whose limit price
is p∗
. While this is a deviation from Raberto et al.’s mechanism (which randomly
discards the excess demand from compatible orders), our method is more realistic
and easier to implement.
4.3. Trade Settling
The settling of trades ﬁrst involves removing trades that aren’t matched by setting
the order quantity ORDERSQ[i] to zero.
// Nullify trades that won’t be carried out
if ( (ORDERSQ[i] > 0 && ORDERSP[i] < eqPrice) || (ORDERSQ[i] < 0
&& ORDERSP[i] > eqPrice) ){
ORDERSQ[i] = 0;
}
}
30

The product of each agent’s order quantity ORDERSQ[i] and the market clearing price
eqPrice is then subtracted from her cash holdings CASH[i] , and the order quantity
ORDERSQ[i] is added to her share holdings SHARES[i] .
// Decrease/increase cash and share holdings of buyers/sellers
CASH[i] -= ORDERSQ[i]* eqPrice;
SHARES[i] += ORDERSQ[i];
}
31

5. Parallel Implementation
The core of the GASM essentially consists of a loop over two dimensions: ﬁrst over
time (ticks), then over agents.
Iterations of the time dimension, ticks, are not independent. Indeed, the equi-
librium price at a given tick t is a function of agents’ orders which depend (among
others) on these agents’ cash and share holdings at that tick t. These holdings are
themselves dependent on the equilibrium price and agents’ orders at the previous tick
t − 1.
Agents, however, are logically identical but functionally independent; their pseudo-
random numbers, cash and share holdings, and order quantities and limit prices are
independent from those of all other agents. Each agent generates and submits trades
without any interaction with other agents45
.
Therefore, the agent dimension of our code can be executed in parallel, while the
time dimension cannot. This entails that we cannot parallelize execution across ticks
but only within ticks. This section describes how various core functions performed at
every tick are ported to the GPU using OpenACC and Thrust.
Table 4: Algorithm Versions
Version
0 1 2 3 4 5 6
Function
Calculate agent sigmas
Pseudorandom number generation
Order generation
Demand, supply & excess demand functions
Market clearing
Nullify uncleared orders
Modify agent’s holdings
Calculate and store output variables
Table 4 indicates which functions of the GASM’s core are executed on the CPU (in
gray) and the GPU (in blue), respectively, in each of the seven diﬀerent Versions.
The functions are displayed in the order in which they are executed, except for the
demand function which is evaluated once to calculate an output variable. Version 0
is the purely serial implementation executed entirely on the CPU, while Version 6
ports the entire core of the model to the GPU for parallel execution.
Due to the fact that the CPU and the GPU do not share memory (in our worksta-
tion and in most high performance systems), the shifting of computations between
the two in each Version has implications on data transfers. These transfers are very
expensive and are one of the reasons why our goal is to port the whole of the model’s
core to the GPU; to minimize the amount of memory transfers. In particular, we
focus our attention on core data transfers, i.e. those performed at every tick of the
model due to a change in value of the underlying variables or arrays. Since these
45
At least in our setting of the GASM without opinion propagation among agents through clusters.
32

transfers are repeated thousands of times during the simulation, they have a much
greater impact on the program’s running time than non-core data transfers, i.e. those
performed only once outside the core of the model.
5.1. Version 1
Version 1 ports the generation of order quantities and limit prices to the GPU. This
function consists of a for loop which is free of data dependencies and can therefore
be easily parallelized using an OpenACC loop directive within a parallel region.
The code within the loop is identical to that of the serial implementation described
in subsubsection 4.1.3.
#pragma acc parallel loop present(...,ORDERSP [0: nAgents],ORDERSQ [0:
nAgents],...)
for(int i=0; i<nAgentsCopy; ++i){
...
ORDERSP[i] = price;
ORDERSQ[i] = quantity;
...
}
Note that the loop trip count nAgents has been changed to the variable nAgentsCopy
= nAgents defined within the same scope as the loop. This is necessary due to the
fact that nAgents shares the same data type as at least one array operated on in the
loop ( int ), which causes problems for the compiler46
.
While the loop is free of dependencies and can therefore be parallelized, it suffers
from branch divergence due to the 3 if-else statements it contains. Each thread
executing the body of the loop can therefore take 23
= 8 different execution paths:
two for buyers and sellers, two for positive and negative random scaling factors, and
two for zero and non-zero order quantities. Assuming the execution paths of the
32 threads within a warp are uniformly distributed across the 8 different branches,
the warp will pass through the body of the loop 8 times with only 32
8
= 4 threads
performing useful work at every pass. This results in a 4
32
= 1
8
= 12.5% utilization of
the processing elements.
Furthermore, this version incurs the greatest amount of core data transfers. Indeed,
the generation of the two arrays ORDERSP and ORDERSQ requires synchronizing the
CPU-generated standard normal RANDN and standard uniform RANDU pseudorandom
numbers, the agent sigmas UNIQSIG , agents holdings CASH and SHARES and the last
market price lastPrice with the GPU at every tick. Table 5 summarizes the core
data transfers in and out of the GPU in Version 1.
46
According to Mathew Colgrove at PGI, for an example for loop over an integer array data[]
of length nData :
“The integer loop is not parallelizable when nData is a int and data is an int* because
the compiler has no way of knowing if data points to nData , as such must presume that it
does. Hence the loop is not countable and therefore not parallizable or vectorizable. The same
issue occurs with host code. If data was a float (or other non- int ), then this issue wouldnt
occur.”
Rob Farber (2014). Pragma Puzzler - Ambiguous Loop Trip Count in OpenMP and OpenACC
- TechEnablement. [Online; accessed 31-Jan-2015]. url: http://www.techenablement.com/
pragma-puzzler-ambiguous-loop-trip-count-in-openmp-and-openacc/
33

Accelerating economics: how GPUs can save you time and money

Accelerating economics: how GPUs can save you time and money

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Accelerating economics: how GPUs can save you time and money

Semelhante a Accelerating economics: how GPUs can save you time and money (20)

Último

Último (20)

Accelerating economics: how GPUs can save you time and money