GPU Seminar Report Covers Graphics Processing

A seminar report on,
“GRAPHICS PROCESSING UNIT”
by:
Mr. Siddhartha V

ACKNOWLEDGEMENT:
I would like to thank respected Mr…….. and Mr. ……..for giving me such a wonderful
opportunity to expand my knowledge for my own branch and giving me guidelines to present a
seminar report. It helped me a lot to realize of what we study for.
Secondly, I would like to thank my parents who patiently helped me as i went through my work and
helped to modify and eliminate some of the irrelevant or un-necessary stuffs
.
Thirdly, I would like to thank my friends who helped me to make my work more organized and
well-stacked till the end.
Next, I would thank Microsoft for developing such a wonderful tool like MS Word. It helped my
work a lot to remain error-free
.
Last but clearly not the least, I would thank The Almighty for giving me strength to complete my
report on time.

PREFACE:
I have made this report file on the topic Graphics Processing Unit; I have tried my best to
elucidate all the relevant detail to the topic to be included in the report. While in the beginning I
have tried to give a general view about this topic.
My efforts and wholehearted co-corporation of each and everyone has ended on a successful
note. I express my sincere gratitude to …………..who assisting me throughout the preparation of
this topic. I thank him for providing me the reinforcement, confidence and most importantly the
track for the topic whenever I needed it.

CONTENTS:
1 .ABSTRACT
2. INTRODUCTION
3. WHAT’S A GPU ???
4. HISTORY AND STANDARDS
5 .COMPUTER GRAPHICS MILESTONE
6. GPU ARCHITECHTURE
7. MODERN GPU ARCHITECHTURE
8. PERIPHERAL COMPONENT INTERCONNECT
9. ACCELLERATED GRAPHICS PORT
10. HOW IS 3D ACCELLERATION DONE ?
11. COMPONENTS OF GPU
12. PERFORMANCE FACTOR OF GPU
13. TYPES OF GPU
14. GPU COMPUTING
15. TECHNICHES AND APPLICATIOS
16 .ALGORITHMS AND APPLICATIONS
17. TOP TEN PROBLEMS IN GPU
18. CONCLUTION

ABSTRACT :
The graphics processing unit (GPU) has become an Integral part of today’s mainstream
computing systems. Over The past six years, there has been a marked increase in the Performance
and capabilities of GPUs. The modern GPU is not Only a powerful graphics engine but also a highly
parallel Programmable processor featuring peak arithmetic and memory Bandwidth that
substantially outpaces its CPU counterpart. The GPU’s rapid increase in both programmability and
Capability has spawned a research community that has Successfully mapped a broad range of
computationally demanding, Complex problems to the GPU. This effort in general purpose
Computing on the GPU, also known as GPU computing, Has positioned the GPU as a compelling
alternative to Traditional microprocessors in high-performance computer Systems of the future.
We describe the background, hardware, And programming model for GPU computing, summarize
the State of the art in tools and techniques, and present four GPU Computing successes in game
physics and computational Biophysics that deliver order-of-magnitude performance gains Over
optimized CPU applications.
A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically
for the processing of 3D graphics. The processor is built with integrated transform, lighting, triangle
setup/clipping, and rendering engines, capable of handling millions of math-intensive processes per
second. GPUs form the heart of modern graphics cards, relieving the CPU (central processing units)
of much of the graphics processing load. GPUs allow products such as desktop PCs, portable
computers, and game consoles to process real-time 3D graphics that only a few years ago were only
available on high-end workstations.

INTRODUCTION:
There are various applications that require a 3D world to be simulated as realistically as
possible on a computer screen. These include 3D animations in games, movies and other real world
simulations. It takes a lot of computing power to represent a 3D world due to the great amount of
information that must be used to generate a realistic 3D world and the complex mathematical
operations that must be used to project this 3D world onto a computer screen. In this situation, the
processing time and bandwidth are at a premium due to large amounts of both computation and
data.
The functional purpose of a GPU then, is to provide a separate dedicated graphics resources,
including a graphics processor and memory, to relieve some of the burden off of the main system
resources, namely the Central Processing Unit, Main Memory, and the System Bus, which would
otherwise get saturated with graphical operations and I/O requests. The abstract goal of a GPU,
however, is to enable a representation of a 3D world as realistically as possible. So these gpus are
designed to provide additional computational power that is customized specifically to perform these
3D tasks.
The GPU is designed for a particular class of applications with the following characteristics.
Over the past few years, a growing community has identified other applications with
similar characteristics and successfully mapped these applications onto the GPU.
• Computational requirements are large. Real-time rendering requires billions of pixels per second,
and each pixel requires hundreds or more operations. GPUs must deliver an enormous amount of
compute performance to satisfy the demand of complex real-time applications.
.
• Parallelism is substantial. Fortunately, the graphics pipeline is well suited for parallelism.
Operations on vertices and fragments are well matched to finegrained closely coupled
programmable parallel compute units, which in turn are applicable to many other computational
domains.
• Throughput is more important than latency. GPU implementations of the graphics pipeline
prioritize throughput over latency. The human visual system operates on millisecond time scales,
while operations Within a modern processor take nanoseconds. This six-order-of-magnitude gap
means that the Latency of any individual operation is unimportant. As a consequence, the graphics
pipeline is quite deep, perhaps hundreds to thousands of cycles, With thousands of primitives in

flight at any given time. The pipeline is also feed-forward, removing The penalty of control
hazards, further allowing optimal throughput of primitives through the Pipeline. This emphasis on
throughput is characteristic of applications in other areas as well
.
Just as important in the development of the GPU as a general-purpose computing engine has
been the advancement of the programming model and programming tools. The challenge to GPU
vendors and researchers has been to strike the right balance between low-level access to the
hardware to enable performance and high-level programming languages and tools that allow
programmer flexibility and productivity, all in the face of rapidly advancing hardware. Because of
the primitive nature of the tools and techniques, the first generation of applications were notable for
simply working at all. As the field matured, the techniques became more sophisticated and the
comparisons with non-GPU work more rigorous.
Even though GPUs today have more computational horsepower, they are finetuned for the
type of computation that is required for computer graphics which is highly parallel, numerically
demanding with little to no data reuse. Although many types of computations demonstrate qualities
of high parallelism and numerical intensiveness, they also require significant data reuse. CPUs have
large caches with high bandwidths that facilitate the reuse of data that makes them very suitable for
general purpose computation whereas a GPU has much smaller caches with lower bandwidths since
they are geared for the type of computations that are required for graphics.
We conclude by looking to the future: what features can we expect in future systems, and
what are the most important problems that we must address as the field moves forward? One of the
most important Challenges for gpu computing is to connect with the mainstream fields of processor
architecture and programming Systems, as well as learn from the parallel computing experts of the
past.

WHAT’S A GPU????
A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically
for the processing of 3D graphics. The processor is built with integrated transform, lighting, triangle
setup/clipping, and rendering engines, capable of handling millions of math-intensive processes per
second. Gpus form the heart of modern graphics cards, relieving the CPU (central processing units)
of much of the graphics processing load. Gpus allow products such as desktop pcs, portable
computers, and game consoles to process real-time 3D graphics that only a few years ago were only
available on high-end workstations.
Used primarily for 3-D applications, a graphics processing unit is a single-chip processor
that creates lighting effects and transforms objects every time a 3D scene is redrawn. These are
mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting this
burden from the CPU frees up cycles that can be used for other jobs.
However, the GPU is not just for playing 3D-intense videogames or for those who create
graphics (sometimes referred to as graphics rendering or content-creation) but is a crucial
component that is critical to the PC's overall system speed. In order to fully appreciate the graphics
card's role it must first be understood.
Many synonyms exist for Graphics Processing Unit in which the popular one being the
graphics card .It’s also known as a video card, video accelerator, video adapter, video board,
graphics accelerator, or graphics adapter.

HISTORY AND STANDARDS
The first graphics cards, introduced in August of 1981 by IBM, were monochrome cards
designated as Monochrome Display Adapters (mdas). The displays that used these cards were
typically text-only, with green or white text on a black background. Color for IBM-compatible
computers appeared on the scene with the 4-color Hercules Graphics Card (HGC), followed by
the 8-color Color Graphics Adapter (CGA) and 16-color Enhanced Graphics Adapter (EGA).
During the same time, other computer manufacturers, such as Commodore, were introducing
computers with built-in graphics adapters that could handle a varying number of colors.
When IBM introduced the Video Graphics Array (VGA) in 1987, a new graphics standard
came into being. A VGA display could support up to 256 colors (out of a possible 262,144-color
palette) at resolutions up to 720x400. Perhaps the most interesting difference between VGA and the
preceding formats is that VGA was analog, whereas displays had been digital up to that point.
Going from digital to analog may seem like a step backward, but it actually provided the ability to
vary the signal for more possible combinations than the strict on/off nature of digital
.
Over the years, VGA gave way to Super Video Graphics Array (SVGA). SVGA cards
were based on VGA, but each card manufacturer added resolutions and increased color depth in
different ways. Eventually, the Video Electronics Standards Association (VESA) agreed on a
standard implementation of SVGA that provided up to 16.8 million colors and 1280x1024
resolution. Most graphics cards available today support Ultra Extended Graphics Array (UXGA).
UXGA can support a palette of up to 16.8 million colors and resolutions up to 1600x1200 pixels.
Even though any card you can buy today will offer higher colors and resolution than the
basic VGA specification, VGA mode is the de facto standard for graphics and is the minimum on
all cards. In addition to including VGA, a graphics card must be able to connect to your computer.
While there are still a number of graphics cards that plug into an Industry Standard Architecture
(ISA) or Peripheral Component Interconnect (PCI) slot, most current graphics cards use the
Accelerated Graphics Port (AGP).

COMPUTER GRAPHICS MILESTONES:
The term ―Computer Graphics‖ was devised by a Boeing graphics designer in 1960 to
describe the work he was doing for Boeing of designing a more efficient cockpit space. His
contributions are a computer generated orthographic view of the human body and some of the first
3D computer animation. There are many additional technologies and advancements that have
helped to propel the field of computer graphics to its current state. In this section we will recognize
many milestones from the first digital computer to the beginnings of the personal computer (PC).
The ramifications of these events far exceed their contributions to the field of computer graphics.
Most people would agree that the first digital computer is where the first computer graphics systems
started.
THE WHIRLWIND AND SAGE PROJECTS:
Fig 1: Whirlwind computer project Fig 2: SAGE Project
By almost any measure—scale, expense, technical complexity, or influence on future
developments—the single most important computer project of the postwar decade was MIT’s
Whirlwind and its offspring, the SAGE computerized air defense system . The design and
implementation of the Whirlwind computer project began in 1944 by Jay Forrester and Ken Olsen
of MIT. The project came out of the Navy’s Airplane Stability and Control Analyzer project
(ASCA). The idea was to come up with a design for a programmable flight simulation computer
which could be programmed to provide training for Navy pilots on any aircraft without having to
customize a new computer for every aircraft type.
Although Whirlwind was not the first digital computer, it was the first computer built
specifically for interactive, real‐time control which displayed real‐time text and graphics on a video
terminal . Because the memory at the time was not fast enough to allow Whirlwind to be a real‐time
control computer, a new type of memory was created by Jay Forrester called core memory. This
was the technique of using a matrix of wires with donut shaped ferrite ceramic magnets (called a

core) at each junction to produce random access memory. Although the focus of the Whirlwind
project started out as a general‐purpose flight simulator, it soon morphed into a design for a
general‐purpose, real‐time digital computer with the ability to do more than flight simulation
calculations. The Air Force saw a potential for this real‐time, general purpose computer, where
there was none before and took over funding of the Whirlwind project to support a project of their
own termed SAGE (Semi‐Automatic Ground Environment) .
By 1958 the Air Force project SAGE had its first operational directional center using the
Whirlwind computer as the control center. The SAGE project shown in Figure 2 was designed to
use radar information to track and identify aircraft in flight by converting it into computer generated
pictures. The SAGE project continued the advancements of the Whirlwind project and the
combination of Whirlwind and
SAGE presented several advancements that helped create the field of computer graphics.
Core memory allowed the replacement of slower, physically larger and smaller memory sizes that
relied on vacuum Tubes and thus helped propel real‐time computing. Although the CRT (cathode
ray tube) was being used as a display for televisions, oscilloscopes and some other computer
projects, it was the Whirlwind/SAGE projects that showed the CRT to be a feasible option for
display and interactive computing. The Whirlwind/SAGE projects also introduced the light pen as
an input device to allow the user to mark spots on the CRT for the computer to be able to store and
use the data for computations.
.
MIT’s TX‐0 and TX‐2:
The Whirlwind computer was built using 5,000 vacuum tubes. The SAGE computer system
using the newer Whirlwind II still used vacuum tubes, but with the advent of the transistor now also
incorporated over 13,000 transistors . As the transistor replaced the vacuum tubes in computers, it
allowed for greatly reduced temperature and space requirements. The first real‐time, programmable,
general purpose computer made solely from transistors was the MIT built TX‐0 (Transistorized
Experimental Computer Zero) . The TX‐0 was basically a transistorized Whirlwind. The Whirlwind
computer was so large that it filled an entire
Floor of a large building, but the TX‐0 fit in the relatively smaller area of a room and was
somewhat faster . The TX‐0 was the fastest computer of its era and was the predecessor of the
minicomputer and the direct ancestor of the DEC PDP‐1. The Lincoln laboratory was paid by the
Air Force to build the TX‐0 to determine if it was feasible to build a major computing system based
on only transistors instead of vacuum tubes. It was also used to test the viability of larger, complex
core memory first implemented in Whirlwind. The TX‐2 project was begun shortly after the

successful completion of TX‐0 in 1956. It was much larger than the TX‐0 and was built using
22,000 transistors and was key to the evolution of interactive computer graphics. There are many
contributions that the TX‐0/TX‐2 projects provided for the field of computer graphics. Besides the
obvious contribution of showing a much smaller, more powerful transistorized computer is
possible; the project made extensive use of the CRT and light pen as an interactive graphics
workstation and pushed the next level of computer graphics evolution. The TX‐0/TX‐2 would be
used by Ivan Sutherland to create the famed interactive graphics program Sketchpad . The project
allowed the Co‐creator, Ken Olsen, to take the knowledge he gained on the project and start a
company that would go on to become one of the most influential computer companies of the 1970s
and 1980s, Digital Equipment Corporation (DEC).
SUTHERLAND’S SKETCHPAD:
Not only did the TX‐2 have a 9 inch CRT display, but it also had a number of other input
and output devices such as the light pen, a magnetic tape storage device, an on‐line typewriter, a
bank of control switches, paper tape for program input and the first Xerox printer. All of these new
man‐machine interfaces made for an environment that was ripe for the right person to take
advantage of them and that person was Ivan Sutherland at MIT. In his 1963 Ph.D. Thesis using the
powerful TX‐2, Ivan Sutherland created Sketchpad which as time has revealed was the precursor of
the direct manipulation computer graphic interface of today . For his work beginning with
Sketchpad, he is widely recognized as the grandfather of interactive computer graphics . Sketchpad
was a program written for the TX‐2 that allowed a user to draw and manage points, line segments,
and arcs on a CRT monitor using a hand‐held Light pen. These drawings were very precise and
offered a scale of 2000:1 allowing for a relatively large drawing area . These drawings were not
merely pictures, but were more importantly computer data that was presented and manipulated by
the user graphically.
The user had the ability to create object relationships using the various primitives that were
allowed and could build up complex drawings by combining different elements and shapes.
Sketchpad was important for many reasons, but most importantly it freed the user from having to
program the computer with Instructions to perform. Instead, it allowed the user to interact with the
computer via the light pen and the CRT monitor thus setting the ground work for more advanced
man‐machine interaction. Sketchpad is the ancestor of computer aided design (CAD) and the
modern graphical user interface (GUI).

Digital Equipment Corporation and the Minicomputer:
Shortly after starting on the design of the TX‐2 in 1957, Ken Olsen, one of the original
designers and builders of the TX‐0 and the TX‐2 computers left MIT and became the founder of
the Digital Equipment Corporation (DEC). In Ken Olsen’s own words, ―I was building the
hardware, somebody else was designing the logic and they couldnʹt settle down. So after a year or
two of that I got impatient and left” . That would prove to be the best move of his life not to
mention a necessary step in the evolution of the computer graphics industry. Shortly after in 1960,
the DEC PDP‐1 (programmed data processor) became the commercial manifestation of the Air
Force funded, MIT designed and built TX‐0/TX‐2 .
Although only 49 PDP‐1s were ever sold, DEC went on to build some of the most
influential computers of their time with the first minicomputer the PDP‐8 and also the 16‐bit word
PDP‐11 being their most successful and important computer with over 600,000 units and over 20
years of production of the PDP‐11 alone . Not only was the PDP‐11 the impetus behind several
successful CPU architectures such as the VAX supermini and the Motorola 68000 microprocessor
family, it is also important for its general computing nature since it had over 23 different operating
systems written for it and 10 different programming languages. The DEC VAX would become the
workhorse of the CAD industry in the 1980s.
High‐end cad graphics systems:
As mentioned earlier in this chapter, computer graphics and most certainly Computer Aided
Design (CAD) can be traced back to Sutherland’s work with Sketchpad. The idea of having the
computer do all of the graphical calculations, being able to manipulate an existing drawing, not
having to manually draw, correct and redraw a drawing and all dependent drawings appealed to
industries that relied on graphical representations like automotive and aerospace. General Motors
took the idea of CAD and CAM (Computer Aided Manufacturing) and teamed up with IBM to
begin working on one of the first CAD/CAM systems, which was the DAC‐1 (Design Augmented
by Computer) . The DAC‐1 system expanded the Sketchpad idea by creating, rotating and viewing
3D models, but the system still utilized a time‐shared model of computing .
In 1967, the IDIIOM (Information Displays, Inc., Input‐Output Machine) became the first
stand‐alone CAD workstation. Building on the demand and success of previous CAD workstations,
IDI, a company that manufactured primarily custom CRT displays, began looking at combining
their displays with a small third‐party, integrated circuit computer to produce a relatively
inexpensive stand‐alone CAD workstation. The IDIIOM helped move the CAD and graphics
industry forward by taking the computing expensive, time‐sharing mainframe component out of the

CAD environment by implementing their system with a much less expensive minicomputer instead
of a 13 mainframe. They also designed and programmed their own CAD software called IDADS
(Information Displays Automatic Drafting System), which greatly reduced current drafting costs
and turn around time. The benefit to providing the user‐friendly CAD software application with
IDIIOM was that the user was no longer required to code their own software to utilize the machine.
The pc revolution:
The most important part of the computer is the central processing unit (CPU). Generically, a
CPU has circuit logic for a control unit that takes instructions from memory and then decodes and
executes them and also circuitry for arithmetic and logical operations mainly referred to as an
arithmetic logic unit (ALU). Early computers used transistors and/or vacuum tubes for this logic.
With the advent of integrated circuits (IC) the microprocessor was able to take all of this circuitry
and put it onto a single silicon IC, a single chip. This allowed for much more complex logic to be
put into small chips thus reducing size, cost and temperature. With the creation of the first
microprocessor in 1971, the Intel 4004, development of computers for personal use became
appealing and affordable.
It is hard to pinpoint when the first PC came into existence as the idea is highly ambiguous
at best. Thus to simplify, I define PC as being digital, including a microprocessor, being
user‐programmable, commercially manufactured, small enough to allow it to be moved by the
average person, inexpensive enough to be affordable by the average professional and simple enough
to be used without special training. The MITS (Micro Instrumentation Telemetry Systems) Altair
8800 featured on the cover of Popular Electronics magazine in 1975 is the computer that fits the
criterion . It doesn’t quite resemble what we think of today as a PC since it didn’t have a CRT or a
keyboard. The user programmed the computer by flipping toggle switches and read the output from
a panel of neon bulbs.
With the announcement of the Altair 8800 many other companies starting creating pcs most
notably Apple Computer Inc. Pcs were initially only being purchased by consumers in small
numbers, but in 1981 IBM came out with the IBM PC and instantly legitimized the PC market.
With IBM’s entry into the PC market, corporations finally started purchasing pcs for business use.
This mass adoption by corporations also had the effect of creating standards for the PC industry
which further propelled the market. With the advent of the PC and industry standards for hardware
as well as software, the computer graphics industry was poised to take great strides in development.

GPU ARCHITECTURE
The GPU has always been a processor with ample computational resources. The most
important recent trend, however, has been exposing that computation to the programmer. Over the
past few years, the GPU has evolved from a fixed-function special-purpose processor into a full-
fledged parallel programmable processor with additional fixed-function special-purpose
functionality. More than ever, the programmable aspects of the processor have taken centre stage.
We begin by chronicling this evolution, starting from the structure of the graphics pipeline and how
the GPU has become a general-purpose architecture, then taking a closer look at the architecture of
the modern GPU.
The Graphics Pipeline
The input to the GPU is a list of geometric primitives, typically triangles, in a 3-D world
coordinate system. Through many steps, those primitives are shaded and mapped onto the screen,
where they are assembled to create a final picture. It is instructive to first explain the specific steps
in the canonical pipeline before showing how the pipeline has become programmable.
 Vertex Operations: Vertex operations transform raw 3D geometry into the 2D
plane of your monitor. Vertex pipelines also eliminate unneeded geometry by
detecting parts of the scene that are hidden by other parts and simply discarding
those parts.
 Primitive Assembly: The vertices are assembled into triangles, the fundamental
hardware-supported primitive in today’s GPUs.
 Rasterization: Rasterization is the process of determining which screen-space pixel
locations are covered by each triangle. Each triangle generates a primitive called a
―fragment‖ at each screen-space pixel location that it covers. Because many triangles
may overlap at any pixel location, each pixel’s color value may be computed from
several fragments.

 Fragment Operations: Using color information from the vertices and possibly
fetching additional data from global memory in the form of textures (images that are
mapped onto surfaces), each fragment is shaded to determine its final color. Just as
in the vertex stage, each fragment can be computed in parallel. This stage is typically
the most computationally demanding stage in the graphics pipeline.
 Composition: Fragments are assembled into a final image with one color per pixel,
usually by keeping the closest fragment to the camera for each pixel location.
The Graphics logical pipeline(The programmable blocks are in blue)
Historically, the operations available at the vertex and fragment stages were configurable but not
programmable. For instance, one of the key computations at the vertex stage is computing the color
at each vertex as a function of the vertex properties and the lights in the scene. In the fixed-function
pipeline, the programmer could control the position and color of the vertex and the lights, but not
the lighting model that determined their interaction.
ARCHITECTURE OF A MODERN GPU
We noted that the GPU is built for different application demands than the CPU: large,
parallel computation requirements with an emphasis on throughput rather than latency.
Consequently, the architecture of the GPU has progressed in a different direction than that of the
CPU.

Basic Unified GPU architecture: The programmable shader stages execute on the array of
unified processors, and the logical graphics pipeline dataflow recirculates through the processors.
Consider a pipeline of tasks, such as we see in most graphics APIs (and many other applications),
that must process a large number of input elements. In such a pipeline, the output of each
successive task is fed into the input of the next task. The pipeline exposes the task parallelism of the
application, as data in multiple pipeline stages can be computed at the same time; within each stage,
computing more than one element at the same time is data parallelism. To execute such a pipeline, a
CPU would take a single element (or group of elements) and process the first stage in the pipeline,
then the next stage, and so on. The CPU divides the pipeline in time, applying all resources in the
processor to each stage in turn.
GPUs have historically taken a different approach. The GPU divides the resources of the
processor among the different stages, such that the pipeline is divided in space, not time. The part of
the processor working on one stage feeds its output directly into a different part that works on the
next stage.
This machine organization was highly successful in fixed-function GPUs for two reasons.
First, the hardware in any given stage could exploit data parallelism within that stage, processing
multiple elements at the same time. Because many task-parallel stages were running at any time, the
GPU could meet the large compute needs of the graphics pipeline. Secondly, each stage’s hardware
could be customized with special-purpose hardware for its given task, allowing substantially greater
compute and area efficiency over a general-purpose solution. For instance, the Rasterization stage,
which computes pixel coverage information for each input triangle, is more efficient when
implemented in special-purpose hardware. As programmable stages (such as the vertex and
fragment programs) replaced fixed-function stages, the special-purpose fixed function components
were simply replaced by programmable components, but the task-parallel organization did not
change.
The result was a lengthy, feed-forward GPU pipeline with many stages, each typically
accelerated by special purpose parallel hardware. In a CPU, any given operation may take on the
order of 20 cycles between entering and leaving the CPU pipeline. On a GPU, a graphics operation
may take thousands of cycles from start to finish. The latency of any given operation is long.
However, the task and data parallelism across and between stages delivers high throughput.

PERIPHERAL COMPONENT INTERCONNECT (PCI)
There are a lot of incredibly complex components in a computer. And all of these parts need
to communicate with each other in a fast and efficient manner. Essentially, a bus is the channel or
path between the components in a computer. During the early 1990s, Intel introduced a new bus
standard for consideration, the Peripheral Component Interconnect (PCI).It provides direct access to
system memory for connected devices, but uses a bridge to connect to the front side bus and
therefore to the CPU.
Fig 3: The illustration above shows how the various buses connect to the CPU.
PCI can connect up to five external components. Each of the five connectors for an external
component can be replaced with two fixed devices on the motherboard. The PCI bridge chip
regulates the speed of the PCI bus independently of the CPU's speed. This provides a higher degree
of reliability and ensures that PCI-hardware manufacturers know exactly what to design for.
PCI originally operated at 33 MHz using a 32-bit-wide path. Revisions to the standard include
increasing the speed from 33 MHz to 66 MHz and doubling the bit count to 64. Currently, PCI-X
provides for 64-bit transfers at a speed of 133 MHz for an amazing 1-GBps (gigabyte per second)
transfer rate!
PCI cards use 47 pins to connect (49 pins for a mastering card, which can control the PCI
bus without CPU intervention). The PCI bus is able to work with so few pins because of hardware
multiplexing, which means that the device sends more than one signal over a single pin. Also, PCI
supports devices that use either 5 volts or 3.3 volts. PCI slots are the best choice for network
interface cards (NIC), 2-D video cards, and other high-bandwidth devices. On some PCs, PCI has
completely superseded the old ISA expansion slots.
Although Intel proposed the PCI standard in 1991, it did not achieve popularity until the
arrival of Windows 95 (in 1995). This sudden interest in PCI was due to the fact that Windows 95

supported a feature called Plug and Play (PnP). PnP means that you can connect a device or insert
a card into your computer and it is automatically recognized and configured to work in your system.
Intel created the PnP standard and incorporated it into the design for PCI. But it wasn't until several
years later that a mainstream operating system, Windows 95, provided system-level support for
PnP. The introduction of PnP accelerated the demand for computers with PCI.
ACCELERATED GRAPHICS PORT (AGP)
The need for streaming video and real-time-rendered 3-D games requires an even faster
throughput than that provided by PCI. In 1996, Intel debuted the Accelerated Graphics Port
(AGP), a modification of the PCI bus designed specifically to facilitate the use of streaming video
and high-performance graphics.
AGP is a high-performance interconnect between the core-logic chipset and the graphics
controller for enhanced graphics performance for 3D applications. AGP relieves the graphics
bottleneck by adding a dedicated high-speed interface directly between the chipset and the graphics
controller as shown below.
Fig 4: dedicated high-speed interface directly between the chipset and the graphics controller
Segments of system memory can be dynamically reserved by the OS for use by the graphics
controller. This memory is termed AGP memory or non-local video memory. The net result is that
the graphics controller is required to keep fewer texture maps in local memory.
AGP has 32 lines for multiplexed address and data. There are an additional 8 lines for
sideband addressing. Local video memory can be expensive and it cannot be used for other
purposes by the OS when unneeded by the graphics of the running applications. The graphics
controller needs fast access to local video memory for screen refreshes and various pixel elements
including Z-buffers, double buffering, overlay planes, and textures.

For these reasons, programmers can always expect to have more texture memory available
via AGP system memory. Keeping textures out of the frame buffer allows larger screen resolution,
or permits Z-buffering for a given large screen size. As the need for more graphics intensive
applications continues to scale upward, the amount of textures stored in system memory will
increase. AGP delivers these textures from system memory to the graphics controller at speeds
sufficient to make system memory usable as a secondary texture store.
AGP MEMORY ALLOCATION
During AGP memory initialization, the OS allocates 4K byte pages of AGP memory in
main (physical) memory. These pages are usually discontiguous. However, the graphics controller
needs contiguous memory. A translation mechanism called the GART (Graphics Address
Remapping Table), makes discontiguous memory appear as contiguous memory by translating
virtual addresses into physical addresses in main memory through a remapping table.
A block of contiguous memory space, called the Aperture is allocated above the top of
memory. The graphics card accesses the Aperture as if it were main memory. The GART is then
able to remap these virtual addresses to physical addresses in main memory. These virtual addresses
are used to access main memory, the local frame buffer, and AGP memory.
Fig 5: Memory allocation in AGP
AGP TRANSFERS
AGP provides two modes for the graphics controller to directly access texture maps in
system memory: pipelining and sideband addressing. Using Pipe mode, AGP overlaps the
memory or bus access times for a request ("n") with the issuing of following requests

("n+1"..."n+2"... etc.). In the PCI bus, request "n+1" does not begin until the data transfer of request
"n" finishes.
With sideband addressing (SBA), AGP uses 8 extra "sideband" address lines which allow
the graphics controller to issue new addresses and requests simultaneously while data continues to
move from previous requests on the main 32 data/address lines. Using SBA mode improves
efficiency and reduces latencies.
AGP SPECIFICATIONS
The current PCI bus supports a data transfer rate up to 132 MB/s, while AGP (at 66MHz)
supports up to 533 MB/s! AGP attains this high transfer rate due to it's ability to transfer data on
both the rising and falling edges of the 66MHz clock
Mode
Clock
rate(approximate)
Transfer rate
1x 66 MHz 266
2x 133 MHz 533
4x 266 MHZ 1066
8x 533 MHZ 2133
The AGP slot typically provides performance which is 4 to 8 times faster than the pci slots
inside your computer.

HOW IS 3D ACCELERATION DONE??????
There are different steps involved in creating a complete 3D scene. It is done by different
parts of the GPU, each of which are assigned a particular job. During 3D rendering, there are
different types of data the travel across the bus. The two most common types are texture and
geometry data. The geometry data is the "infrastructure" that the rendered scene is built on. This is
made up of polygons (usually triangles) that are represented by vertices, the end-points that define
each polygon. Texture data provides much of the detail in a scene, and textures can be used to
simulate more complex geometry, add lighting, and give an object a simulated surface.
Many new graphics chips now have accelerated Transform and Lighting (T&L) unit,
which takes a 3D scene's geometry and transforms it into different coordinate spaces. It also
performs lighting calculations, again relieving the CPU from these math-intensive tasks.
Following the T&L unit on the chip is the triangle setup engine. It takes a scene's transformed
geometry and prepares it for the next stages of rendering by converting the scene into a form that
the pixel engine can then process. The pixelengine applies assigned texture values to each pixel.
This gives each pixel the correct color value so that it appears to have surface texture and does not
look like a flat, smooth object. After a pixel has been rendered it must be checked to see whether it
is visible by checking the depth value, or Z value.
A Z check unit performs this process by reading from the Z-buffer to see if there are any
other pixels rendered to the same location where the new pixel will be rendered. If another pixel
is at that location, it compares the Z value of the existing pixel to that of the new pixel. If the new
pixel is closer to the view camera, it gets written to the frame buffer. If it's not, it gets discarded.
After the complete scene is drawn into the frame buffer the RAMDAC converts this digital data
into analog that can be given to the monitor for display.

COMPONENTS OF GPU:
There are several components on a typical graphics card:
Fig3: AGP/PCI interface
Graphics Processor: The graphics processor is the brains of the card, and is typically one of three
configurations.
Graphics co-processor: A card with this type of processor can handle all of the graphics chores
without any assistance from the computer's CPU. Graphics coprocessors are typically found on
high-end video cards.
Graphics accelerator: In this configuration, the chip on the graphics card renders graphics based
on commands from the computer's CPU. This is the most common configuration used today.
Frame buffer: This chip simply controls the memory on the card and sends information to the
digital-to-analog converter (DAC) . It does no processing of the image data and is rarely used
anymore.
Memory: The type of RAM used on graphics cards varies widely, but the most popular types use
a dual-ported configuration. Dual-ported cards can write to one section of memory while it is
reading from another section, decreasing the time it takes to refresh an image.
Graphics BIOS: Graphics cards have a small ROM chip containing basic information that tells
the other components of the card how to function in relation to each other. The BIOS also
performs diagnostic tests on the card's memory and input/ output (I/O) to ensure that everything is
functioning correctly.

Digital-to-Analog Converter (DAC): AC on a graphics card is commonly known as a RAMDAC
because it takes the data it converts directly from the card's memory. RAMDAC speed greatly
affects the image you see on the monitor. This is because the refresh rate of the image depends on
how quickly the analog information gets to the monitor.
Display Connector: Graphics cards use standard connectors. Most cards use the 15-pin connector
that was introduced with Video Graphics Array (VGA).
Computer (Bus) Connector: This is usually Accelerated Graphics Port (AGP). This port enables
the video card to directly access system memory. Direct memory access helps to make the peak
bandwidth four times higher than the Peripheral Component Interconnect (PCI) bus adapter card
slots. This allows the central processor to do other tasks while the graphics chip on the video card
accesses system memory.
PERFORMANCE FACTORS OF GPU
There are many factors that affect the performance of a GPU. Some of the factors that are directly
visible to a user are given below.
Fill Rate:
It is defined as the number of pixels or texels (textured pixels) rendered per second by the
GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as high
as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given to it.
Memory Bandwidth:
It is the data transfer speed between the graphics chip and its local frame buffer. More
bandwidth usually gives better performance with the image to be rendered is of high quality and at
very high resolution.
Memory Management:
The performance of the GPU also depends on how efficiently the memory is managed,
because memory bandwidth may become the only bottle neck if not managed properly.
Hidden Surface removal:
A term to describe the reducing of overdraws when rendering a scene by not rendering
surfaces that are not visible. This helps a lot in increasing the performance of GPU, by preventing
overdraw so that the fill rate of the GPU can be utilized to the maximum.

TYPES OF GPUS…
There are mainly two types of GPUs, they are
1. Those that can handle all of the graphics processes without any assistance from the computer's
CPU. They are typically found on high-end workstations. These are mainly used for Digital Content
Creation like 3D animation as it supports a lot of 3D functions.
Some of them are……
o 1.Quadro series from NVIDIA.
o 2.Wildcat series from 3D Labs.
o 3.FireGL series from ATI.
2. The chip on the graphics card renders graphics based on commands from the computer's CPU.
This is the most common configuration used today. These are used for 3D gaming and such smaller
tasks. They are found on normal desktop PCs and are better known as 3D accelerators. These
support less functions and hence are cheaper.
Some of them are…….
o Geforce series from NVIDIA.
o Radeon series from ATI Technology ltd.
o Kyro series from STM Microelectronics
Today’s GPU can do what was hoped for and beyond. In the last year a giant leap have been
made in the GPU technology. The maximum amount of RAM that can be found on a graphics card
has jumped from 16MB to a whopping 128MB. The premier company in GPU manufacturing
ATI,who has held the position past couple of years has given way to nVidia , whose new ground
breaking technology is leaving ATI to follow.

GPU COMPUTING
Now that we have seen the hardware architecture of the GPU, we turn to its programming
model.
A. The GPU Programming Model
The programmable units of the GPU follow a single program multiple-data (SPMD)
programming model. For efficiency, the GPU processes many elements (vertices or fragments) in
parallel using the same program. Each element is independent from the other elements, and in the
base programming model, elements cannot communicate with each other. All GPU programs must
be structured in this way: many parallel elements each processed in parallel by a single program.
Each element can operate on 32-bit integer or floating point data with a reasonably complete
general-purpose instruction set. Elements can read data from a shared global memory (a ―gather‖
operation) and, with the newest GPUs, also write back to arbitrary locations in shared global
memory (―scatter‖).
This programming model is well suited to straight-line programs, as many elements can be
processed in lockstep running the exact same code. Code written in this manner is single
instruction, multiple data (SIMD). As shader programs have become more complex, programmers
prefer to allow different elements to take different paths through the same program, leading to the
more general SPMD model. How is this supported on the GPU?
One of the benefits of the GPU is its large fraction of resources devoted to computation.
Allowing a different execution path for each element requires a substantial amount of control
hardware. Instead, today’s GPUs support arbitrary control flow per thread but impose a penalty for
incoherent branching. GPU vendors have largely adopted this approach. Elements are grouped
together into blocks, and blocks are processed in parallel. If elements branch in different directions
within a block, the hardware computes both sides of the branch for all elements in the block. The
size of the block is known as the ―branch granularity‖ and has been decreasing with recent GPU
generations - today, it is on the order of 16 elements.
In writing GPU programs, then, branches are permitted but not free. Programmers who structure
their code such that blocks have coherent branches will make the best use of the hardware.
B. General-Purpose Computing on the GPU
Mapping general-purpose computation onto the GPU uses the graphics hardware in much
the same way as any standard graphics application. Because of this similarity, it is both easier and
more difficult to explain the process. On one hand, the actual operations are the same and are easy

to follow; on the other hand, the terminology is different between graphics and general-purpose use.
We begin by describing GPU programming using graphics terminology, then show how the same
steps are used in a general-purpose way to author GPGPU applications, and finally use the same
steps to show the more simple and direct way that today’s GPU computing applications are written.
1) Programming a GPU for Graphics: We begin with the same GPU pipeline, concentrating on
the programmable aspects of this pipeline.
1) The programmer specifies geometry that covers a region on the screen. The rasterizer
generates a fragment at each pixel location covered by that geometry.
2) Each fragment is shaded by the fragment program.
3) The fragment program computes the value of the fragment by a combination of math
operations and global memory reads from a global ―texture‖ memory.
4) The resulting image can then be used as texture on future passes through the graphics
pipeline.
2) Programming a GPU for General-Purpose Programs:
One of the historical difficulties in programming GPGPU applications has been that despite
their general-purpose tasks’ having nothing to do with graphics, the applications still had to be
programmed using graphics APIs. In addition, the program had to be structured in terms of the
graphics pipeline, with the programmable units only accessible as an intermediate step in that
pipeline, when the programmer would almost certainly prefer to access the programmable units
directly. The programming environments we describe in detail are solving this difficulty by
providing a more natural, direct, non-graphics interface to the hardware and, specifically, the
programmable units. Today, GPU computing applications are structured in the following way.
1) The programmer directly defines the computation domain of interest as a structured grid
of threads.
2) An SPMD general-purpose program computes the value of each thread.
3) The value for each thread is computed by a combination of math operations .
4) The resulting buffer in global memory can then be used as an input in future computation.
This programming model is a powerful one for several reasons. First, it allows the hardware
to fully exploit the application’s data parallelism by explicitly specifying that parallelism in the
program. Next, it strikes a careful balance between generality (a fully programmable routine at each
element) and restrictions to ensure good performance. Finally, its direct access to the programmable
units eliminates much of the complexity faced by previous GPGPU programmers in co-opting the
graphics interface for general-purpose programming. As a result, programs are more often

expressed in a familiar programming language (such as NVIDIA’s C-like syntax in their CUDA
programming environment) and are simpler and easier to build and debug (and are becoming more
so as the programming tools mature). The result is a programming model that allows its users to
take full advantage of the GPU’s powerful hardware but also permits an increasingly high-level
programming model that enables productive authoring of complex applications.
TECHNIQUES AND APPLICATIONS
We now survey some important computational primitives, algorithms, and applications for GPU
computing. We first highlight four data-parallel operations central to GPU computing: performing
scatter/gather memory operations, mapping a function onto many elements in parallel, reducing a
collection of elements to a single element or value, and computing prefix reductions of an array in
parallel. A.Computational Primitives
The data-parallel architecture of GPUs requires programming idioms long familiar to
parallel supercomputer users but often new to today’s programmers reared on sequential machines
or loosely coupled clusters. We briefly discuss four important idioms: scatter/gather, map, reduce,
and scan. We describe these computational primitives in the context of both ―old‖ (i.e., graphics-
based) and ―new‖ (direct compute) GPU computing to emphasize the simplicity and flexibility of
the direct-compute approach.
Scatter/gather: write to or read from a computed location in memory. Graphics-based GPU
computing allows efficient gather using the texture subsystem, storing data as images (textures) and
addressing data by computing corresponding image coordinates and performing a texture fetch.
However, texture limitations make this unwieldy: texture size restrictions require wrapping arrays
containing more than 4096 elements into multiple rows of a two-dimensional (2-D) texture, adding
extra addressing math, and a single texture fetch can only retrieve four 32-bit floating point values,
limiting per-element storage. Scatter in graphics-based GPU computing is difficult and requires
rebinding data for processing as vertices, either using vertex texture fetch or render-to-vertex-
buffer. By contrast, direct-compute layers allow unlimited reads and writes to arbitrary locations in
memory. NVIDIA’s
CUDA allows the user to access memory using standard C constructs (arrays, pointers, variables).
AMD’s CTM is nearly as flexible but uses 2-D addressing.
Map: apply an operation to every element in a collection. Typically expressed as a for loop
in a sequential program (e.g., a thread on a single CPU core), a parallel implementation can reduce
the time required by applying the operation to many elements in parallel. Graphics-based GPU
computing performs map as a fragment program to be invoked on a collection of pixels (one pixel

for each element). Each pixel’s fragment program fetches the element data from a texture at a
location corresponding to the pixel’s location in the rendered image, performs the operation, then
stores the result in the output pixel. Similarly, CTM and CUDA would typically launch a thread
program to perform the operation in many threads, with each thread loading an element, performing
the computation, and storing the result. Note that since loops are supported, each thread may also
loop over several elements.
Reduce: repeatedly apply a binary associative operation to reducing a collection of elements to a
single element or value. Examples include finding the sum (average, minimum, maximum,
variance, etc.) of a collection of values. A sequential implementation on a traditional CPU would
loop over an array, successively summing (for example) each element with a running sum of
elements seen so far. By contrast, a parallel reduce-sum implementation would repeatedly perform
sums in parallel on an ever-shrinking set of elements.1 Graphics-based GPU computing implements
reduce by rendering progressively smaller sets of pixels. In each rendering pass, a fragment
program reads multiple values from a texture (performing perhaps four or eight texture reads),
computes their sum, and writes that value to the output pixel in another texture (four or eight times
smaller), which is then bound as input to the same fragment shader and the process repeated until
the output consists of a single pixel that contains the result of the final reduction. CTM and CUDA
express this same process more directly, for example, by launching a set of threads each of which
reads two elements and writes their sum to a single element. Half the threads then repeat this
process, then half of the remaining threads, and so on until a single surviving thread writes the final
result to memory.
Scan: Sometimes known as parallel-prefix-sum, scan takes an array A of elements and returns an
array B of the same length in which each element B½i_ represents a reduction of the subarray A[1 .
. . i]. Scan is an extremely useful building block for data-parallel algorithms; Blelloch describes a
wide variety of potential applications of scan ranging from quicksort to sparse matrix operations.
Harris et al. demonstrate an efficient scan implementation using CUDA (Fig. 2); their results
illustrate the advantages of a direct-compute over graphics-based GPU computing. Their CUDA
implementation outperforms the CPU by a factor of up to 20 and OpenGL by a factor of up to
seven.

Scan performance on CPU, graphics-based GPU (using OpenGL), and direct-compute GPU (using CUDA).
Results obtained on aGeForce 8800 GTX GPU and Intel Core2-Duo Extreme 2.93 GHz CPU. (Figure adapted from
Harris et al.)
ALGORITHMS AND APPLICATIONS
Building largely on the above primitives, researchers have demonstrated many higher level
algorithms and applications that exploit the computational strengths of the GPU. We give only a
brief survey of GPU computing algorithms and their application domains here.
Sort: GPUs have come to excel
at sorting as the GPU computing community has rediscovered, adapted, and improved seminal
sorting algorithms, notably merge sort. This ―sorting network‖ algorithm is intrinsically parallel and
oblivious, meaning the same steps are executed regardless of input.
Differential equations: The earliest attempts to use GPUs for nongraphics computation focused on
solving large sets of differential equations. Particle tracing is a common GPU application for
ordinary differential equations, used heavily in scientific visualization (e.g., the scientific flow
exploration system by Kru¨ger et al.) and in visual effects for computer games. GPUs have been
heavily used to solve problems in partial differential equations (PDEs) such as the Navier–Stokes
equations for incompressible fluid flow. Particularly successful applications of GPU PDE solvers
include fluid dynamics (e.g., Bolz et al.) and level set equations for volume segmentation.
Linear algebra: Sparse and dense linear algebra routines are the core building blocks for a huge
class of numeric algorithms, including many PDE solvers mentioned above. Applications include
simulation of physical effects such as fluids, heat, and radiation, optical effects such as depth of

field, and so on. The use of direct-compute layers such as CUDA and CTM both simplifies and
improves the performance of linear algebra on the GPU. For example, NVIDIA provides CuBLAS,
a dense linear algebra package implemented in CUDA and following the popular BLAS
conventions. Sparse linear algebraic algorithms, which are more varied and complicated than dense
codes, are an open and active area of research; researchers expect sparse codes to realize benefits
similar to or greater than those of the new GPU computing layers.
Recurring Themes
Several recurring themes emerge throughout the algorithms and applications explored in GPU
computing to date. Examining these themes allows us to characterize what GPUs do well.
Successful GPU computing applications do the following.
Emphasize parallelism: GPUs are fundamentally parallel machines, and their efficient utilization
depends on a high degree of parallelism in the workload. For example, NVIDIA’s CUDA prefers to
run thousands of threads at one time to maximize opportunities to mask memory latency using
multithreading. Emphasizing parallelism requires choosing algorithms that divide the computational
domain into as many independent pieces as possible. To maximize the number of simultaneous
running threads, GPU programmers should also seek to minimize thread usage of shared resources
(such as local registers and CUDA shared memory) and should use synchronization between
threads sparingly.
Minimize SIMD divergence: As Section III discusses, GPUs provide an SPMD programming
model: multiple threads run the same program but access different data and thus may diverge in
their execution. At some granularity, however, GPUs perform SIMD execution on batches of
threads (such as CUDA ―warps‖). If threads within a batch diverge, the entire batch will execute
both code paths until the threads reconverge. High-performance GPU computing thus requires
structuring code to minimize divergence within batches.
Maximize arithmetic intensity: In today’s computing landscape, actual computation is relatively
cheap but bandwidth is precious. This is dramatically true for GPUs with their abundant floating-
point horsepower. To obtain maximum utilization of that power requires structuring the algorithm
to maximize the arithmetic intensity or number of numeric computations performed per memory
transaction. Coherent data accesses by individual threads help, since these can be coalesced into
fewer total memory transactions. Use of CUDA shared memory on NVIDIA GPUs also helps,

reducing overfetch (since threads can communicate) and enabling strategies for ―blocking‖ the
computation in this fast on-chip memory.
Exploit streaming bandwidth: Despite the importance of arithmetic intensity, it is worth noting
that GPUs do have very high peak bandwidth to their onboard memory, on the order of 10x the
CPU-memory bandwidths on typical PC platforms. This is why GPUs can outperform CPUs at
tasks such as sort, which have a low computation/bandwidth ratio. To achieve high performance on
such applications requires streaming memory access patterns in which threads read from and write
to large coherent blocks (maximizing bandwidth per transaction) located in separate regions of
memory (avoiding data hazards).
Experience has shown that when algorithms and applications can follow these design principles for
GPU computing- such as the PDE solvers, linear algebra packages, and database systems
referenced above, and the game physics and molecular dynamics applications examined in detail
next-they can achieve 10–100x speedups over even mature, optimized CPU codes.
TOP TEN PROBLEMS IN GPGPU
The killer applications: Perhaps the most important question facing the community is finding an
application that will drive the purchase of millions of GPUs. The number of GPUs sold today for
computation is minuscule compared to the overall GPU market of half a billion units per year; a
mass-market application that spurred millions of GPU sales, enabling a task that was not previously
possible, would mark a major milestone in GPU computing.
Programming models and tools: With the new programming systems in Section IV, the state of
the art over the past year has substantially improved. Much of the difficulty of early GPGPU
programming has dissipated with the new capabilities of these programming systems, though
support for debugging and profiling on the hardware is still primitive. One concern going forward,
however, is the proprietary nature of the tools. Standard languages, tools, and APIs that work across
GPUs from multiple vendors would advance the field, but it is as yet unclear whether those
solutions will come from academia, the GPU vendors, or third-party software companies, large or
small.
GPU in tomorrow’s computer?: The fate of coprocessors in commodity computers (such as
floating-point coprocessors) has been to move into the chipset or onto the microprocessor. The
GPU has resisted that trend with continued improvements in performance and functionality and by

becoming an increasingly important part of today’s computing environments unlike with CPUs, the
demand for continued GPU performance increases has been consistently large. However,
economics and potential performance are motivating the migration of powerful GPU functionality
onto the chipset or onto the processor die itself. While it is fairly clear that graphics capability is a
vital part of future computing systems, it is wholly unclear which part of a future computer will
provide that capability, or even if an increasingly important GPU with parallel computing
capabilities could absorb a CPU.
Relationship to other parallel hardware and software: GPUs are not the only innovative parallel
architecture in the field. The Cell Broadband Engine, multicore CPUs, stream processors, and
others are all exploiting parallelism in different ways. The future health of GPU computing would
benefit if programs written for GPUs run efficiently on other hardware and programs written for
other architectures can be run on GPUs. The landscape of parallel computing will continue to
feature many kinds of hardware, and it is important that GPUs be able to benefit from advances in
parallel computing that are targeted toward a broad range of hardware.
Managing rapid change: Practitioners of GPU computing know that the interface to the GPU
changes markedly from generation to generation. This is a very different model than CPUs, which
typically maintain API consistency over many years. As a consequence, code written for one
generation of GPUs is often no longer optimal or even useful in future generations. However, the
lack of backward compatibility is an important key in the ability of GPU vendors to innovate in
new GPU generations without bearing the burden of previous decisions. The introduction of the
new general-purpose programming environments from the vendors that we described in Section IV
may finally mark the beginning of the end of this churn. Historically, CPU programmers have
generally been able to write code that would continue to run faster on new hardware (though the
current focus on multiple cores may arrest this trend; like GPUs, CPU codes will likely need to be
written as parallel programs to continue performance increases). For GPU programmers, however,
the lack of backward compatibility and the lack of roadmaps going forward make writing
maintainable code for the long term a difficult task.
Performance evaluation and cliffs: The science of program optimization for CPUs is reasonably
well understood -profilers and optimizing compilers are effective in allowing programmers to make
the most of their hardware. Tools on GPUs are much more primitive -making code run fast on the
GPU remains something of a black art. One of the most difficult ordeals for the GPU programmer is
the performance cliff, where small changes to the code, or the use of one feature rather than

another, make large and surprising differences in performance. The challenge going forward is for
vendors and users to build tools that provide better visibility into the hardware and better feedback
to the programmer about performance characteristics.
Philosophy of faults and lack of precision: The hardware graphics pipeline features many
architectural decisions that favored performance over correctness. For output to a display, these
tradeoffs were quite sensible; the difference between perfectly ―correct‖ output and the actual
output is likely indistinguishable. The most notable tradeoff is the precision of 32-bit floating-point
values in the graphics pipeline. Though the precision has improved, it is still not IEEE compliant,
and features such as denorms are not supported. As this hardware is used for general-purpose
computation, noncompliance with standards becomes much more important, and dealing with
faults-such as exceptions from division by zero, which are not currently supported in GPUs-also
becomes an issue.
Broader toolbox for computation and data structures: On CPUs, any given application is likely
to have only a small fraction of its code written by its author. Most of the code comes from
libraries, and the application developer concentrates on high-level coding, relying on established
APIs such as STL or Boost or BLAS to provide lower level functionality. We term this a
―horizontal‖ model of software development, as the program developer generally only writes one
layer of a complex program. In contrast, program development for general-purpose computing on
today’s GPUs is largely ―vertical‖ -the GPU programmer writes nearly all the code that goes into
his program, from the lowest level to the highest. Libraries of fundamental data structures and
algorithms that would be applicable to a wide range of GPU computing applications (such as
NVIDIA’s FFT and dense matrix algebra libraries) are only just today being developed but are vital
for the growth of GPU computing in the future.

CONCLUSION
With the rising importance of GPU computing, GPU hardware and software are changing at
a remarkable pace. In the upcoming years, we expect to see several changes to allow more
flexibility and performance from future GPU computing systems:
• At Supercomputing 2006, both AMD and NVIDIA announced future support for double-precision
floating-point hardware by the end of 2007. The addition of double-precision support removes one
of the major obstacles for the adoption of the GPU in many scientific computing applications.
• Another upcoming trend is a higher bandwidth path between CPU and GPU. The PCI Express bus
between CPU and GPU is a bottleneck in many applications, so future support for PCI Express 2,
Hyper Transport, or other high-bandwidth connections is a welcome trend. Sony’s PlayStation 3
and Microsoft’s Xbox 360 both feature CPU– GPU connections with substantially greater
bandwidth than PCI Express, and this additional bandwidth has been welcomed by developers. We
expect the highest CPU–GPU bandwidth will be delivered by future systems, such as AMD’s
Fusion, that place both the CPU and GPU on the same die. Fusion is initially targeted at portable,
not high performance, systems, but the lessons learned from developing this hardware and its
heterogeneous APIs will surely be applicable to future single-chip systems built for performance.
One open question is the fate of the GPU’s dedicated high-bandwidth memory system in a
computer with a more tightly coupled CPU and GPU.
• Pharr notes that while individual stages of the graphics pipeline are programmable, the structure of
the pipeline as a whole is not, and proposes future architectures that support not just programmable
shading but also a programmable pipeline. Such flexibility would lead to not only a greater variety
of viable rendering approaches but also more flexible general-purpose processing.
• Systems such as NVIDIA’s 4-GPU Quadroplex are well suited for placing multiple coarse-grained
GPUs in a graphics system. On the GPU computing side, however, fine-grained cooperation
between GPUs is still an unsolved problem. Future API support such as Microsoft’s Windows
Display Driver Model 2.1 will help multiple GPUs to collaborate on complex tasks, just as clusters
of CPUs do today.

GPU Seminar Report Covers Graphics Processing

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to GPU Seminar Report Covers Graphics Processing

Similar to GPU Seminar Report Covers Graphics Processing (20)

Recently uploaded

Recently uploaded (20)

GPU Seminar Report Covers Graphics Processing