SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
PROJECT REPORT
(PROJECT SEMESTER TRAINING)
Object Oriented and Aspect Oriented Programming with Cuda
Submitted by
Ankita Dewan
Roll No. 101053004
Under the Guidance of
Dr. Ashutosh Mishra Dr. Balwinder Sodhi
Assistant Professor, Dept of CSE, Assistant Professor, Dept of CSE,
Thapar University, Patiala. IIT Ropar.
Department of Computer Science and Engineering
THAPAR UNIVERSITY, PATIALA
Jan-May 2014
DECLARATION
I hereby declare that the project work entitled “Object Oriented and Aspect Oriented Programming
with Cuda” is an authentic record of my own work carried out at IIT Ropar as requirements of
project semester term for the award of degree of B.E. (Computer Science & Engineering), Thapar
University, Patiala, under the guidance of Dr. Ashutosh Mishra and Dr. Balwinder Sodhi, during
5th Jan to 28th May, 2014.
Ankita Dewan
101053004
Date: 30th
May, 2014
Certified that the above statement made by the student is correct to the best of our knowledge and
belief.
Dr. Ashutosh Mishra Dr. Balwinder Sodhi
Assistant Professor, Dept of CSE, Assistant Professor, Dept of CSE,
Thapar University, Patiala. IIT Ropar.
Acknowledgment
I take this opportunity to express my heartfelt gratitude to my mentor Dr. Balwinder Sodhi for his
constant support. His priceless suggestions, ideas and expertise helped me better the quality of my
project. He has been extremely supportive throughout the course of my internship for which I
express my deep and sincere gratitude.
I appreciate all the help and support given to me by my internship colleague Anusha Vangala
from Siddhartha Institute of Technology Vijayavada, Andhra Pradesh and all others from
Computer Science department who helped me avail the numerous facilities.
My acknowledgement would be incomplete without thanking my parents for their constant love
and support and being there by my side through thick and thin.
Abstract
One of the primary aims of computer science is simplification and facilitation. There is a constant
drive to introduce abstraction and/or virtualization so that the primitive building blocks of any
technology are preserved in a constructive and sophisticated manner. Here on, it becomes easier to
add/modify features to the technology.
Performance is an indispensable requirement. In the context of processing and computations,
parallel processing proves to be faster. Technologies like NVIDIA CUDA enable the user to send
C/C++/Fortran code (depending on the technology) straight to GPU with no assembly language
required.
So far, papers and applications mostly in academia/institutes like CERN have experimented and
used this technology to describe how performance of certain algorithms improves by implementing
them on CUDA. GPUs have been targeted for games but here again CUDA has not found its use
on a commercial basis. Many programmers are of the opinion CUDA is not “elegant”. Writing a
"hello world" program in CUDA can be a day of struggle just to get things working. And for
someone who has lesser knowledge of these techniques or wants to not get into the details of it,
simplification, facilitation and convenience must come into the picture.
Our project aims to simplify the manner in which CUDA is presently available with other
techniques that can compliment it without taking away the very essence of it.
Institute Profile
Indian Institute of Technology Ropar, established in 2008, is one of the eight new IITs set up by
the Ministry of Human Resource Development (MHRD), Government of India, to expand the
reach and enhance the quality of technical education in the country.
The institute is committed to providing state-of-the-art technical education in a variety of fields
and also for facilitating transmission of knowledge in keeping with latest developments.
At present, the institute offers Bachelor of Technology (B. Tech.) program in Computer Science
and Engineering, Electrical Engineering, and Mechanical Engineering.
The institute is keen to establish Central Research Facility. PhD program was started so that the
research environment is further augmented, expanded, and made even more vibrant.
My internship under the Department of Computer Science and Engineering helped me appreciate
the value of hands-on training and design. I got to work under excellent facilities.
Nomenclature
GPU……………………………………….……..…………...................... Graphics processing unit
GPGPU………………………………………………….. General purpose graphics processing unit
CUDA……………….............................................................Compute Unified Device Architecture
JCuda………………………………………………………….………………………....Java CUDA
AOP……………………..……………………………………............Aspect-oriented programming
AJC……………………………………………………………………….………..AspectJ Compiler
SPMD…………………………………………………………………Single program, multiple data
ISA………………. …………………………………………………….Instruction Set Architecture
Table of Contents
Chapter 1 Introduction
1.1. Motivation………………………………………………………………………..1
1.2. Problem Statement……………………………………………………………….1
1.3. Work Plan…………………………….................................................................2
Chapter 2 Background
2.1 GPU and CUDA…………………………………………………………………3
2.2 JCuda…………………………………………………………………………….5
2.3 Aspect Oriented Programming and AspectJ……………………………………..6
Chapter 3 Body of Work
3.1 Design……………………………………………………………………………7
3.2 Implementation…………………………………………………………………..10
3.3 Procedure………………………………………………………………………..11
Chapter 4 Related Works
4.1 Alternate Technologies……………………………………………………………16
4.2 Past Projects……………………………………………….…………………….17
Chapter 5 Observation and Findings…………………………………..…………………………18
Chapter 6 Limitations …………………………………………………..………………………..19
Chapter 7 Future Work………………………………………………….……………………….20
Chapter 8 Conclusion……………………………………………………………………………..22
References……………………………………………………………………………..23
Table of figures
Fig 1 CPU is composed of only a few cores that can handle fewer threads at a time.
GPU is composed of many cores that handle thousands of threads simultaneously…….3
Fig 2 CUDA stages…………………………………………………………………………….5
Fig 3 Activity Diagram for computing heterogeneous programs………………………………7
Fig 4 Activity Diagram for CUDA program…………………………………………………...8
Fig 5 Entity Relationship Diagram for CPU, CUDA, JCuda and GPU……………………….9
Fig 6 CUDA Sample Screenshot………………………………………………………….…...11
Fig 7 JCuda Sample Screenshot_1…………………………………………………………….12
Fig 8 JCuda Sample Screenshot_2…………………………………………………………….13
Fig 9 AspectJ Sample Screenshot……………………………………………………………..14
Fig 10 JCuda and Aspectj Sample Screenshot………………………………………………….15
[1]
Chapter 1
Introduction
1.1 Motivation
The likelihood of shifting from traditional CPUs to parallel hybrid platforms, such as Multi-core
CPUs accelerated with heterogeneous GPU co-processing systems, is as much as it was when the
hardware field switched over to Multi-threading and Multi-core CPUs.
Although it is much about the hardware functionality, it does impact the software entities and thus
the programmers. There is a need to modify existing programs such that they can be properly
parallelized to reap benefits of advanced processing architectures.
Nvidia invented CUDA (Compute Unified Device Architecture) as a parallel computing platform
and programming model to increase computing performance by harnessing the power of the
graphics processing unit (GPU). So far so good.
The next desirable move:-
Use technologies which when combined with CUDA can make it easier to use and cater the needs
of a larger domain of programmers/users. In technical terms the idea is to abstract out the details of
GPU computations.
1.2 Problem Statement
CUDA alone leads to tangled source code.
Our source code is a combination of:-
 Code for the core kernel computation for device
 Code for kernel management by the host; additionally it contains the code for data transfers
between memory spaces, and various optimizations.
In our project, we work on a programming system based on the principles of Object-Oriented and
Aspect-Oriented Programming. The motive is to un-clutter the code to improve programmability.
[2]
1.3 Work Plan
CUDA source code is written entirely in C. That is, the host as well as device code are meted out
in C.
Our approach:-
The Object-Oriented language, Java bindings (JCuda) in this case, is used to handle the host
computations.
The Aspect-Oriented language, AspectJ in this case, is used to encapsulate all other support
functions, such as parallelization granularity and memory access optimization.
The kernel code remains in C as the device code.
In the last stage, aspect compiler (ajc) is used to combine the core Object Oriented program with
aspects to generate parallelized programs.
[3]
Chapter 2
Background
2.1 GPU and CUDA
CPU and GPU are designed differently. While CPU has Latency Oriented Cores a GPU has
Throughput Oriented Cores. CPU is essentially “the master” throughout the operation domain. It
has powerful ALUs but with reduced operation latency and large caches that convert long latency
memory accesses to short latency cache accesses. It has a sophisticated control system. GPU, on
the other hand, has small caches which boost memory throughput. It has energy efficient ALUs
that are heavily pipelined for high throughput and are more in number. It has simpler control
system.
Fig 1
This field is essentially a part of parallel computing but is heterogeneous in nature as it deals with
serial parts and parallel parts.
CUDA achieves parallelism by SPMD. A thread is a virtualized processor. It follows the
instruction cycle (Fetch, Decode, and Execute). For faster execution of arrays of parallel threads,
CUDA kernel is used such that all threads in a grid run the same kernel code.
Each thread has indexes to decide what data to work on and to make control decisions.
i = blockIdx.x * blockDim.x + threadIdx.x
A thread array is divided into multiple blocks which have access to shared memory.
[4]
The two types of memories in CPU-GPU architecture are global memory and shared memory. The
device can read/write shared as well as global memory. The host can transfer data to/from global
memory. Also, the contents of global memory are visible to all the threads of grid. Any thread can
read and write to any location of the global memory. Shared memory is separate for each block of
the grid. Any thread of a block can read and write to the shared memory of that block. A thread in
one block cannot access shared memory of another block. Shared memory is faster to access than
global memory.
A typical CUDA program includes steps as: -
1. Device memory allocation for input and output entities.
Using cudaMalloc() that allocates object in the device global memory and requires 2
parameters
- Address of a pointer to the allocated object
- Size of allocated object in bytes.
2. Copying from host memory to device memory
Using cudaMemcpy() that requires four parameters:-
- Pointer to destination
- Pointer to source
- Number of bytes copied
- Type/Direction of transfer (cudaMemcpyHostToDevice)
3. Launching the Kernel code from Host.
KernelName<<<dimGrid, dimBlock>>>(m, n, k, d_A, d_B, d_C);
4. Copying back output entity from the device memory to host memory
Using cudaMemcpy() with direction of transfer as cudaMemcpyDeviceToHost
5. Free device memory.
- Using cudaFree() that frees object from device global memory.
CUDA Function Declarations:-
• __global__ defines a kernel function that must return void and is callable from host
• __device__ defines kernel function that need not be void; callable from device itself.
• __host__ defines host function with any return type; callable from host itself.
[5]
A few on-going CUDA projects are CudaRasterization, Monte Carlo with CUDA, MD 5 Hash
Crack in CUDA, Cloth Simulation, and Image Tracking etc.
Fig 2
2.2 JCuda
Java is the one of the most commercially used programming language and if preferred by
programmers of all origins. It is class-based and object-oriented. It provides programmers and
developers the option to "write once, run anywhere" (WORA).
Thus, it became a favorable option to bind Java to a library which acts as an application
programming interface (API) and also provides basic code to use that library for CUDA. JCuda
provides all essential Java bindings for the CUDA runtime and driver API. It acts as the base for
all other libraries like JCublas, JCufft etc.
It lets the host interact with a CUDA device. It provides methods which provide the basic steps of
CUDA in a sequential manner. These methods include device management, event management,
memory allocation on the device and copying memory between the device and the host system.
[6]
2.3 Aspect Oriented Programming and AspectJ
Aspect oriented programming aims to modularize 1
crosscutting concerns as object−oriented
programming does across with common concerns. It aims to deal with two issues: - scattered code
and tangled code.
The power of OOP diminishes beyond encapsulation, abstraction, polymorphism etc. Here on AOP
addresses the problems by using more manageable modules – aspects. Also, unlike OOP, AOP
does not replace previous programming paradigms. Rather it is complementary to the object-
oriented paradigm and not a replacement.
AspectJ is an implementation of AOP for Java. It adds to Java the following concepts.
 Join Point: A well−defined point in the program flow.
 Point cut: Construct to select certain join points and values at those points.
- call: identifies any call to the methods defined by object.
- cflow: identifies join points based if they occur in the dynamic context of another pointcut.
- execution: when a particular method body executes.
- target: when the target object is of some parameter type.
- this: when the object currently executing (i.e. this) is of some parameter type .
- within: when the executing code belongs to the class.
 Advice: Defines code that is executed when a point cut is reached; dynamic parts of AspectJ.
- Before: Runs when a join point is reached and before the computation proceeds, i.e. it runs
when computation reaches the method call and before the actual method starts running.
- After: Runs after the computation 'under the join point' finishes, i.e. after the method body has
run, and just before control is returned to the caller.
- Around: Runs when the join point is reached, and has explicit control over whether the
computation under the join point is allowed to run at all.
 Introduction: Modifies a program's static structure, namely, the members of its classes and the
relationship between classes.
 Aspect: AspectJ's unit of modularity for crosscutting concerns; defined in terms of point cuts,
advice and introduction.
1
Logging, authorization, synchronization, error handling and transaction management exemplify crosscutting
concerns because such strategies necessarily affect more than one part of the system. Logging, for instance, crosscuts
all logged classes and methods.
[7]
Chapter 3
Body of work
3.1 Design
Fig 3
[8]
Fig 4
[9]
Fig 5
[10]
3.2 Implementation
We took a bottom-up approach. The focus started with CUDA and then it got shifted to JCuda and
AspectJ separately followed by interlinking JCuda and AspectJ.
CUDA
To begin with, CUDA (v3.1 and beyond) implementation requires a CUDA-enabled Nvidia GPU
card. Our personal machines do not have one and CPU alone cannot perform the computations.
The earlier versions did support device emulation but the compilation can be quite error prone
apart from the poor performance factor. Hence, we relied upon our mentor’s machine that has
GeForce GT 620 driver with following features:
o Global memory - 1022 Mbytes
o 1 Multiprocessor (MP)
o 48 CUDA Cores/ MP
o GPU Clock rate - 1620 MHz
o Total amount of shared memory per block - 49152 bytes
o Maximum number of threads per multiprocessor – 1536
o Maximum number of threads per block – 1024
Operating Environment: Ubuntu 12.10 32-bit installed as guest OS on VMware Player 5.0.2 for
local computations and also to SSH to the remote machine (Ubuntu 12.04 64-bit OS) with the GPU
card.
[11]
3.3 Procedure
3.3.1 CUDA
Fig 6
CUDA toolkit (v5.5) is downloaded and installed using terminal commands.
The PATH and LD_LIBRARY_PATH environment variables are set for CUDA development.
A CUDA program needs to have the .cu file (with the host and device code) and 3 configuration
files (findcudalib.mk, NsightEclipse.xml and MakeFile) placed in the same folder/directory. The
MakeFile must contain the concerned .cu file name.
Upon using ‘make’ command, a .o object file and an executable file are created. Now the A
"CUDA binary" and contains the compiled code that can directly be loaded and executed by a
specific GPU.
[12]
3.3.2 JCuda
Fig 7
JCuda (v0.5.5) libraries have been compiled for CUDA 5.5. We used Binaries for Linux 64bit. It
contains JAR files and SOs of all libraries. jcuda-0.5.5.jar is mostly used for compilation and
running the JCuda applications.
 For a minimum JCuda program “jcuda.java” without CUDA kernel code
Compilation: Creates the "jcuda.class" file.
Execution: - Prints the information about pointer created in the program.
[13]
Fig 8
 For a full fledged JCuda program “Add.java” with separate CUDA kernel code “AddK.cu”
(Manually)
Compilation: - This kernel code is written exactly in the same way as it is done for CUDA and it
has to be identified and accessed by specifying its name in the source code. It is compiled by the
NVCC compiler to create either a 2
PTX file or 3
CUBIN file that can be loaded and executed using
the Driver API.
Loading and execution: -
The PTX/CUBIN file has to be loaded, and a pointer to the kernel function has to be obtained
2
A human-readable (but hardly human-understandable) file containing a specific form of "assembler" source code.
3
A "CUDA binary" and contains the compiled code that can directly be loaded and executed by a specific GPU. They
are specific for the Compute Capability of the GPU.
Thus, latest samples prefer the use of PTX files, since they are compiled at runtime for the GPU of the target machine.
[14]
3.3.3 AspectJ
Fig 9
Compilation: - The .java and .aj files are listed in .lst file and –arglist option is used with ajc
Execution: - To run the program, the aspectjrt.jar is included in the classpath and java command is
used.
[15]
3.3.4 JCuda and AspectJ
Fig 10
To speculate the feasibility of AspectJ being compatible with JCuda, JCuda Utility classes JAR
archive were also downloaded. The archive jcudaUtils-0.0.4.jar contains the "KernelLauncher"
class which simplifies the setup and launching of kernels using the JCuda Driver API. It creates
PTX files from inlined source code that is given as a String or from existing CUDA source files.
PTX- or CUBIN files can be loaded and the kernels can be called more conveniently due to
automatic setup of the kernel arguments.
Compilation: - Again the .java, .cu and .aj files are listed in .lst file and –argfile option is used with
ajc command. The source and target are specified along with the classpath of jcuda.jar as well as
aspectjrt.jar
Execution: - To run the program, the aspectjrt.jar and jcuda.jar are included in the classpath and
java command is used.
[16]
Chapter 4
Related Work
4.1 Alternate technologies
4.1.1 Open Computing Language (OpenCL)
- Another framework for writing programs that execute across heterogeneous platforms
consisting of CPUs, GPUs and other processors
- Consists of a language for writing kernels and APIs to define and control the platforms; A
very primitive tool.
- CUDA which is limited to Nvidia hardware and is directly connected to the execution
platform but OpenCL is portable.
- CUDA excels over OpenCL because it outperforms OpenCL when natively ported.
- CUDA has more mature tools like debugger, profiler, CUBLAS and CUFFT.
4.1.2 Aparapi
- An AMD product.
- Converts Java bytecode to OpenCL at runtime and executes either on the GPU or in Java
thread pool.
4.1.3 Rootbeer
- GPU compiler used for CUDA; an alternative for nvcc
4.1.4 Java Annotations
- Introduced in JDK 1.5; Organized data about the code, embedded within the code itself.
- Options: -
@Before – Run before the method execution
@After – Run after the method returned a result
@AfterReturning – Run after the method returned a result intercept the returned result as
well.
@AfterThrowing – Run after the method throws an exception
@Around – Run around the method execution, combine all three advices above.
[17]
- Simpler to use than AspectJ as they do not need load-time weaving or separate complier.
AspectJ needs ajc.
- AspectJ supports all pointcuts. It is a more flexible approach and there is little runtime
overhead. With annotations one can only use method-execution pointcut and there is
more runtime overhead.
4.2 Past Projects
Project Sumatra
- An OpenJDK-backed project
- Primary goal: To enable Java applications to take advantage of graphics processing units
(GPUs) AND accelerated processing units (APUs)--whether they are discrete devices or
integrated with a CPU--to improve performance.
- Approach: Software developers annotate their code to indicate which is suited to the
parallel nature of GPUs. When Java application is run on a system with an OpenCL-
compatible GPU installed, the HotSpot JIT (just-in-time) compiler translates the annotated
bits of code to OpenCL for processing on the GPU rather than the CPU.
- Technical Challenges Solved:
Java allows developers to write once and deploy everywhere and hence its widespread
nature, but one area where it can fall flat is performance. Generally, Java applications
cannot perform as well as native applications written for a specific OS.
- Remaining Technical Challenges
 mitigate the complexities of present-day GPU backend and layered standards
 build compromise data schemes for both the JVM and GPU hardware
 support flatter data structures (Complex values, vector, 2D arrays)
 support mix of primitives and JVM-managed pointers
 reduce data copying and inter-phase latency between ISA and loop kernels
 apply existing technology on MapReduce (to JVM execution of GPU code)
 interpret the thread-based Java concurrency model
[18]
Chapter 5
Observation and findings
The two sets of codes written as a part of JCuda and Aspectj perform as good as the original host
code written entirely as a part of JCuda does. Also, there is not much overhead. The interweaving
of code is possible. Thus, the program gets simplified for the generic purposes and for anyone who
wants to bypass device preparation steps.
The device code however continues to be a separate entity written in C. At least with aspect
oriented paradigm it could not be modified into a more easily accessible or a more readily usable
code. Thus, the kernel computation continues to depend on CUDA/JCuda host segment.
[19]
Chapter 6
Limitations
 Our project depends on availability of and access to CUDA enabled GeForce, Tesla or
Quadro GPU either on local machine or on some remote machine. Else implementation or
demonstration of any sort will not be possible.
 If availability and access is assured, hardware configurations and compatibilities are quite
specific. The compute capability and the 4
version of the CUDA driver API have crucial
role. Further the Driver API is backward compatible and hence, mixing and matching
versions will fail to execute. Environment variables need to be accurate for every
tool/technique.
 Obtaining output is not straightforward. The in-kernel printf() works like printf() of
traditional C. It is executed as other device-side functions, i.e. in a per thread manner.
However if it is a multi-threaded kernel, printf() will be executed by every thread, using
specified thread data.
The problem arises with the fact that final formatting of the printf() output has to take
place on the host. The format string must be understood by the host-system’s compiler
and C library. Although efforts have been made so that the format specifiers supported by
CUDA’s printf() form a universal subset from the most common host compilers, but exact
behavior is always host-O/S-dependent.
o 4
All applications, plug-ins, and libraries on a system must use the same version of the
CUDA driver API, since only one version of the CUDA device driver can be installed on a
system.
o All plug-ins and libraries used by an application must use the same version of the runtime.
o All plug-ins and libraries used by an application must use the same version of any libraries
that use the runtime (such as CUFFT, CUBLAS)
[20]
Chapter 7
Future Work
Parallel programming models are need of the hour but they tend to have a somewhat unpredictable
shelf-life. As the hardware platforms underneath them change so rapidly as per the trends, so it
becomes tough to speculate on precise future of CUDA as it looks today. Nevertheless, much
research is being done in this field.
It is an era where almost every technology and every idea has found or is finding its way to Cloud
Computing. With Internet access availability in the bigger picture, anything that has to do with
data storage, manipulation and computation can eventually become a part of "dynamic web".
So when the check list will cover abstraction, simplification, optimization etc, scalability and
availability are the features that might bring CUDA more commercial success.
A cloud-based machine with GPU is as good as a local or remote machine with GPU. Hadoop, a
widely-used MapReduce framework, has already been combined with AMD Aparapi. On similar
lines, the scope of on-going/future CUDA projects can be to have an easy-to-use API which allows
easy implementation of MapReduce algorithms that make use of the GPU. Abstraction can again
be a part of this combination as the API can serve dual purpose of hiding the complexity of GPU
programming and leveraging the numerous benefits of Cloud.
Thus, beyond single-GPU development, efforts in this direction can be extended to the domain of
GPU clusters. For instance the project GPMR has taken up this idea in its body of work.
Synopsis:
MapReduce is the toolset deployed for large-dataset processing. As with a regular MapReduce
model, the data-parallel processing is handled by GPUs.
The existing GPU-MapReduce (GPMR library) work targets solo GPUs. Unlike CPUs, GPUs
cannot source or sink network or I/O sources.
[21]
Scope and Possible Implementation:
- Specific extensions for the GPU, including batching Maps and Reduces via Chunking to
maintain GPU utilization
- Adding accumulation to the Map sub stage
- Adding a Partial Reduction sub stage
- Assembling the MapReduce pipeline to achieve a high overlap of communication and
computation.
Areas of concern:
- Programming multiGPU clusters lacks powerful toolsets and APIs.
- GPU is often treated as a slave device in most GPU-computing applications.
- GPMR is stand-alone and does not sit atop Hadoop or another MapReduce package. It does
not handle fault tolerance. It does not provide a distributed file system (Hadoop Distributed
File System to be precise).
[22]
Chapter 8
Conclusion
The primary aim of the project, which was to speculate the feasibility of breaking down existing
code into two entities and still get accurate results, is served well. The primitive idea of achieving
parallelism with CUDA has now matured into a more sophisticated idea. Paradigms like Object
Oriented Programming and Aspect Oriented Programming have graciously complimented CUDA
without diminishing the power of this technology.
The way JCuda has brought commercial success to CUDA; products like PyCuda have done the
same by rendering flavors of other paradigms like Multi-paradigm approach encompassing object-
oriented, imperative, procedural and reflective programming.
FORTRAN CUDA, CUDA.NET, KappaCUDA. Examples abound.
The list of programming paradigms, compiling/weaving tools, cloud computing techniques and
other existing techniques is exhaustive. Further, CUDA is not the only technology in parallel
computing race. Thus to conform to software quality metrics and to be certified as the 'fitness for
purpose', any technique, in its full fledged form, will have to undergo experimentation. Every
permutation and combination will contribute to this field.
[23]
References
Websites: -
1. Official NVIDIA CUDA Home Page
http://www.nvidia.in/object/cuda_home_new.html
2. Official Eclipse AspectJ Home Page
https://www.eclipse.org/aspectj/doc/next/devguide/ajc-ref.html
3. Official JCuda Home Page
http://www.jcuda.org/tutorial/TutorialIndex.html
Journals/Research Papers: -
1. Aspect-Oriented Programming Beyond Dependency Injection By higeru Chiba and Rei
Ishikawa, Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology
(2008)
2. JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA By
Yonghong Yan, Max Grossman, and Vivek Sarkar, Department of Computer Science, Rice
University (2009)
3. MITHRA: Multiple data Independent Tasks on a Heterogeneous Resource Architecture By
Reza Farivar, Abhishek Verma, Ellick M. Chan, Roy H. Campbell, Department of
Computer Science, University of Illinois at Urbana-Champaign 201 N Goodwin Ave,
Urbana, IL 61801-2302.
4. Tangling and scattering By Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris
Maeda, Cristina Lopes, Jean-Marc Loingtier and John Irwin, Xerox Palo Alto Research
Center, 3333 Coyote Hill Road, Palo Alto, CA 94304, USA.

Mais conteúdo relacionado

Destaque

arvaipeter_teljes_végleges
arvaipeter_teljes_véglegesarvaipeter_teljes_végleges
arvaipeter_teljes_végleges
Péter Árvai
 
Slides for oer panel
Slides for oer panelSlides for oer panel
Slides for oer panel
markmatsalla
 
IRJET-Lymphoma Neoplasm Computable scrutiny of Multi images on Gaussian Disse...
IRJET-Lymphoma Neoplasm Computable scrutiny of Multi images on Gaussian Disse...IRJET-Lymphoma Neoplasm Computable scrutiny of Multi images on Gaussian Disse...
IRJET-Lymphoma Neoplasm Computable scrutiny of Multi images on Gaussian Disse...
IRJET Journal
 

Destaque (10)

arvaipeter_teljes_végleges
arvaipeter_teljes_véglegesarvaipeter_teljes_végleges
arvaipeter_teljes_végleges
 
Slides for oer panel
Slides for oer panelSlides for oer panel
Slides for oer panel
 
Plantilla Informe-Tecnico LA- Lizet Rivera Contreras
Plantilla Informe-Tecnico LA- Lizet Rivera ContrerasPlantilla Informe-Tecnico LA- Lizet Rivera Contreras
Plantilla Informe-Tecnico LA- Lizet Rivera Contreras
 
IRJET-Lymphoma Neoplasm Computable scrutiny of Multi images on Gaussian Disse...
IRJET-Lymphoma Neoplasm Computable scrutiny of Multi images on Gaussian Disse...IRJET-Lymphoma Neoplasm Computable scrutiny of Multi images on Gaussian Disse...
IRJET-Lymphoma Neoplasm Computable scrutiny of Multi images on Gaussian Disse...
 
محاضرة تكييف
محاضرة تكييفمحاضرة تكييف
محاضرة تكييف
 
Great souvenir llc_2017
Great souvenir llc_2017Great souvenir llc_2017
Great souvenir llc_2017
 
C.V
C.VC.V
C.V
 
13.02.2017
13.02.201713.02.2017
13.02.2017
 
THERMODYNAMIC ANALYSIS OF YEAR ROUND AIR CONDITIONING SYSTEM FOR VARIABLE WET...
THERMODYNAMIC ANALYSIS OF YEAR ROUND AIR CONDITIONING SYSTEM FOR VARIABLE WET...THERMODYNAMIC ANALYSIS OF YEAR ROUND AIR CONDITIONING SYSTEM FOR VARIABLE WET...
THERMODYNAMIC ANALYSIS OF YEAR ROUND AIR CONDITIONING SYSTEM FOR VARIABLE WET...
 
Powerfull point ala wenni
Powerfull point ala wenniPowerfull point ala wenni
Powerfull point ala wenni
 

Semelhante a IIT ropar_CUDA_Report_Ankita Dewan

Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
Ashwin Ashok
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
Editor IJARCET
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
Editor IJARCET
 
181114051_Intern Report (11).pdf
181114051_Intern Report (11).pdf181114051_Intern Report (11).pdf
181114051_Intern Report (11).pdf
ToshikJoshi
 

Semelhante a IIT ropar_CUDA_Report_Ankita Dewan (20)

A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Cuda lab manual
Cuda lab manualCuda lab manual
Cuda lab manual
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Graphics Processing Unit: An Introduction
Graphics Processing Unit: An IntroductionGraphics Processing Unit: An Introduction
Graphics Processing Unit: An Introduction
 
CUDA
CUDACUDA
CUDA
 
GPU Computing: An Introduction
GPU Computing: An IntroductionGPU Computing: An Introduction
GPU Computing: An Introduction
 
Auto conversion of serial C code to CUDA code
Auto conversion of serial C code to CUDA codeAuto conversion of serial C code to CUDA code
Auto conversion of serial C code to CUDA code
 
CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019
 
openCL Paper
openCL PaperopenCL Paper
openCL Paper
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
 
Amd fusion apus
Amd fusion apusAmd fusion apus
Amd fusion apus
 
181114051_Intern Report (11).pdf
181114051_Intern Report (11).pdf181114051_Intern Report (11).pdf
181114051_Intern Report (11).pdf
 

IIT ropar_CUDA_Report_Ankita Dewan

  • 1. PROJECT REPORT (PROJECT SEMESTER TRAINING) Object Oriented and Aspect Oriented Programming with Cuda Submitted by Ankita Dewan Roll No. 101053004 Under the Guidance of Dr. Ashutosh Mishra Dr. Balwinder Sodhi Assistant Professor, Dept of CSE, Assistant Professor, Dept of CSE, Thapar University, Patiala. IIT Ropar. Department of Computer Science and Engineering THAPAR UNIVERSITY, PATIALA Jan-May 2014
  • 2. DECLARATION I hereby declare that the project work entitled “Object Oriented and Aspect Oriented Programming with Cuda” is an authentic record of my own work carried out at IIT Ropar as requirements of project semester term for the award of degree of B.E. (Computer Science & Engineering), Thapar University, Patiala, under the guidance of Dr. Ashutosh Mishra and Dr. Balwinder Sodhi, during 5th Jan to 28th May, 2014. Ankita Dewan 101053004 Date: 30th May, 2014 Certified that the above statement made by the student is correct to the best of our knowledge and belief. Dr. Ashutosh Mishra Dr. Balwinder Sodhi Assistant Professor, Dept of CSE, Assistant Professor, Dept of CSE, Thapar University, Patiala. IIT Ropar.
  • 3. Acknowledgment I take this opportunity to express my heartfelt gratitude to my mentor Dr. Balwinder Sodhi for his constant support. His priceless suggestions, ideas and expertise helped me better the quality of my project. He has been extremely supportive throughout the course of my internship for which I express my deep and sincere gratitude. I appreciate all the help and support given to me by my internship colleague Anusha Vangala from Siddhartha Institute of Technology Vijayavada, Andhra Pradesh and all others from Computer Science department who helped me avail the numerous facilities. My acknowledgement would be incomplete without thanking my parents for their constant love and support and being there by my side through thick and thin.
  • 4. Abstract One of the primary aims of computer science is simplification and facilitation. There is a constant drive to introduce abstraction and/or virtualization so that the primitive building blocks of any technology are preserved in a constructive and sophisticated manner. Here on, it becomes easier to add/modify features to the technology. Performance is an indispensable requirement. In the context of processing and computations, parallel processing proves to be faster. Technologies like NVIDIA CUDA enable the user to send C/C++/Fortran code (depending on the technology) straight to GPU with no assembly language required. So far, papers and applications mostly in academia/institutes like CERN have experimented and used this technology to describe how performance of certain algorithms improves by implementing them on CUDA. GPUs have been targeted for games but here again CUDA has not found its use on a commercial basis. Many programmers are of the opinion CUDA is not “elegant”. Writing a "hello world" program in CUDA can be a day of struggle just to get things working. And for someone who has lesser knowledge of these techniques or wants to not get into the details of it, simplification, facilitation and convenience must come into the picture. Our project aims to simplify the manner in which CUDA is presently available with other techniques that can compliment it without taking away the very essence of it.
  • 5. Institute Profile Indian Institute of Technology Ropar, established in 2008, is one of the eight new IITs set up by the Ministry of Human Resource Development (MHRD), Government of India, to expand the reach and enhance the quality of technical education in the country. The institute is committed to providing state-of-the-art technical education in a variety of fields and also for facilitating transmission of knowledge in keeping with latest developments. At present, the institute offers Bachelor of Technology (B. Tech.) program in Computer Science and Engineering, Electrical Engineering, and Mechanical Engineering. The institute is keen to establish Central Research Facility. PhD program was started so that the research environment is further augmented, expanded, and made even more vibrant. My internship under the Department of Computer Science and Engineering helped me appreciate the value of hands-on training and design. I got to work under excellent facilities.
  • 6. Nomenclature GPU……………………………………….……..…………...................... Graphics processing unit GPGPU………………………………………………….. General purpose graphics processing unit CUDA……………….............................................................Compute Unified Device Architecture JCuda………………………………………………………….………………………....Java CUDA AOP……………………..……………………………………............Aspect-oriented programming AJC……………………………………………………………………….………..AspectJ Compiler SPMD…………………………………………………………………Single program, multiple data ISA………………. …………………………………………………….Instruction Set Architecture
  • 7. Table of Contents Chapter 1 Introduction 1.1. Motivation………………………………………………………………………..1 1.2. Problem Statement……………………………………………………………….1 1.3. Work Plan…………………………….................................................................2 Chapter 2 Background 2.1 GPU and CUDA…………………………………………………………………3 2.2 JCuda…………………………………………………………………………….5 2.3 Aspect Oriented Programming and AspectJ……………………………………..6 Chapter 3 Body of Work 3.1 Design……………………………………………………………………………7 3.2 Implementation…………………………………………………………………..10 3.3 Procedure………………………………………………………………………..11 Chapter 4 Related Works 4.1 Alternate Technologies……………………………………………………………16 4.2 Past Projects……………………………………………….…………………….17 Chapter 5 Observation and Findings…………………………………..…………………………18 Chapter 6 Limitations …………………………………………………..………………………..19 Chapter 7 Future Work………………………………………………….……………………….20 Chapter 8 Conclusion……………………………………………………………………………..22 References……………………………………………………………………………..23
  • 8. Table of figures Fig 1 CPU is composed of only a few cores that can handle fewer threads at a time. GPU is composed of many cores that handle thousands of threads simultaneously…….3 Fig 2 CUDA stages…………………………………………………………………………….5 Fig 3 Activity Diagram for computing heterogeneous programs………………………………7 Fig 4 Activity Diagram for CUDA program…………………………………………………...8 Fig 5 Entity Relationship Diagram for CPU, CUDA, JCuda and GPU……………………….9 Fig 6 CUDA Sample Screenshot………………………………………………………….…...11 Fig 7 JCuda Sample Screenshot_1…………………………………………………………….12 Fig 8 JCuda Sample Screenshot_2…………………………………………………………….13 Fig 9 AspectJ Sample Screenshot……………………………………………………………..14 Fig 10 JCuda and Aspectj Sample Screenshot………………………………………………….15
  • 9. [1] Chapter 1 Introduction 1.1 Motivation The likelihood of shifting from traditional CPUs to parallel hybrid platforms, such as Multi-core CPUs accelerated with heterogeneous GPU co-processing systems, is as much as it was when the hardware field switched over to Multi-threading and Multi-core CPUs. Although it is much about the hardware functionality, it does impact the software entities and thus the programmers. There is a need to modify existing programs such that they can be properly parallelized to reap benefits of advanced processing architectures. Nvidia invented CUDA (Compute Unified Device Architecture) as a parallel computing platform and programming model to increase computing performance by harnessing the power of the graphics processing unit (GPU). So far so good. The next desirable move:- Use technologies which when combined with CUDA can make it easier to use and cater the needs of a larger domain of programmers/users. In technical terms the idea is to abstract out the details of GPU computations. 1.2 Problem Statement CUDA alone leads to tangled source code. Our source code is a combination of:-  Code for the core kernel computation for device  Code for kernel management by the host; additionally it contains the code for data transfers between memory spaces, and various optimizations. In our project, we work on a programming system based on the principles of Object-Oriented and Aspect-Oriented Programming. The motive is to un-clutter the code to improve programmability.
  • 10. [2] 1.3 Work Plan CUDA source code is written entirely in C. That is, the host as well as device code are meted out in C. Our approach:- The Object-Oriented language, Java bindings (JCuda) in this case, is used to handle the host computations. The Aspect-Oriented language, AspectJ in this case, is used to encapsulate all other support functions, such as parallelization granularity and memory access optimization. The kernel code remains in C as the device code. In the last stage, aspect compiler (ajc) is used to combine the core Object Oriented program with aspects to generate parallelized programs.
  • 11. [3] Chapter 2 Background 2.1 GPU and CUDA CPU and GPU are designed differently. While CPU has Latency Oriented Cores a GPU has Throughput Oriented Cores. CPU is essentially “the master” throughout the operation domain. It has powerful ALUs but with reduced operation latency and large caches that convert long latency memory accesses to short latency cache accesses. It has a sophisticated control system. GPU, on the other hand, has small caches which boost memory throughput. It has energy efficient ALUs that are heavily pipelined for high throughput and are more in number. It has simpler control system. Fig 1 This field is essentially a part of parallel computing but is heterogeneous in nature as it deals with serial parts and parallel parts. CUDA achieves parallelism by SPMD. A thread is a virtualized processor. It follows the instruction cycle (Fetch, Decode, and Execute). For faster execution of arrays of parallel threads, CUDA kernel is used such that all threads in a grid run the same kernel code. Each thread has indexes to decide what data to work on and to make control decisions. i = blockIdx.x * blockDim.x + threadIdx.x A thread array is divided into multiple blocks which have access to shared memory.
  • 12. [4] The two types of memories in CPU-GPU architecture are global memory and shared memory. The device can read/write shared as well as global memory. The host can transfer data to/from global memory. Also, the contents of global memory are visible to all the threads of grid. Any thread can read and write to any location of the global memory. Shared memory is separate for each block of the grid. Any thread of a block can read and write to the shared memory of that block. A thread in one block cannot access shared memory of another block. Shared memory is faster to access than global memory. A typical CUDA program includes steps as: - 1. Device memory allocation for input and output entities. Using cudaMalloc() that allocates object in the device global memory and requires 2 parameters - Address of a pointer to the allocated object - Size of allocated object in bytes. 2. Copying from host memory to device memory Using cudaMemcpy() that requires four parameters:- - Pointer to destination - Pointer to source - Number of bytes copied - Type/Direction of transfer (cudaMemcpyHostToDevice) 3. Launching the Kernel code from Host. KernelName<<<dimGrid, dimBlock>>>(m, n, k, d_A, d_B, d_C); 4. Copying back output entity from the device memory to host memory Using cudaMemcpy() with direction of transfer as cudaMemcpyDeviceToHost 5. Free device memory. - Using cudaFree() that frees object from device global memory. CUDA Function Declarations:- • __global__ defines a kernel function that must return void and is callable from host • __device__ defines kernel function that need not be void; callable from device itself. • __host__ defines host function with any return type; callable from host itself.
  • 13. [5] A few on-going CUDA projects are CudaRasterization, Monte Carlo with CUDA, MD 5 Hash Crack in CUDA, Cloth Simulation, and Image Tracking etc. Fig 2 2.2 JCuda Java is the one of the most commercially used programming language and if preferred by programmers of all origins. It is class-based and object-oriented. It provides programmers and developers the option to "write once, run anywhere" (WORA). Thus, it became a favorable option to bind Java to a library which acts as an application programming interface (API) and also provides basic code to use that library for CUDA. JCuda provides all essential Java bindings for the CUDA runtime and driver API. It acts as the base for all other libraries like JCublas, JCufft etc. It lets the host interact with a CUDA device. It provides methods which provide the basic steps of CUDA in a sequential manner. These methods include device management, event management, memory allocation on the device and copying memory between the device and the host system.
  • 14. [6] 2.3 Aspect Oriented Programming and AspectJ Aspect oriented programming aims to modularize 1 crosscutting concerns as object−oriented programming does across with common concerns. It aims to deal with two issues: - scattered code and tangled code. The power of OOP diminishes beyond encapsulation, abstraction, polymorphism etc. Here on AOP addresses the problems by using more manageable modules – aspects. Also, unlike OOP, AOP does not replace previous programming paradigms. Rather it is complementary to the object- oriented paradigm and not a replacement. AspectJ is an implementation of AOP for Java. It adds to Java the following concepts.  Join Point: A well−defined point in the program flow.  Point cut: Construct to select certain join points and values at those points. - call: identifies any call to the methods defined by object. - cflow: identifies join points based if they occur in the dynamic context of another pointcut. - execution: when a particular method body executes. - target: when the target object is of some parameter type. - this: when the object currently executing (i.e. this) is of some parameter type . - within: when the executing code belongs to the class.  Advice: Defines code that is executed when a point cut is reached; dynamic parts of AspectJ. - Before: Runs when a join point is reached and before the computation proceeds, i.e. it runs when computation reaches the method call and before the actual method starts running. - After: Runs after the computation 'under the join point' finishes, i.e. after the method body has run, and just before control is returned to the caller. - Around: Runs when the join point is reached, and has explicit control over whether the computation under the join point is allowed to run at all.  Introduction: Modifies a program's static structure, namely, the members of its classes and the relationship between classes.  Aspect: AspectJ's unit of modularity for crosscutting concerns; defined in terms of point cuts, advice and introduction. 1 Logging, authorization, synchronization, error handling and transaction management exemplify crosscutting concerns because such strategies necessarily affect more than one part of the system. Logging, for instance, crosscuts all logged classes and methods.
  • 15. [7] Chapter 3 Body of work 3.1 Design Fig 3
  • 18. [10] 3.2 Implementation We took a bottom-up approach. The focus started with CUDA and then it got shifted to JCuda and AspectJ separately followed by interlinking JCuda and AspectJ. CUDA To begin with, CUDA (v3.1 and beyond) implementation requires a CUDA-enabled Nvidia GPU card. Our personal machines do not have one and CPU alone cannot perform the computations. The earlier versions did support device emulation but the compilation can be quite error prone apart from the poor performance factor. Hence, we relied upon our mentor’s machine that has GeForce GT 620 driver with following features: o Global memory - 1022 Mbytes o 1 Multiprocessor (MP) o 48 CUDA Cores/ MP o GPU Clock rate - 1620 MHz o Total amount of shared memory per block - 49152 bytes o Maximum number of threads per multiprocessor – 1536 o Maximum number of threads per block – 1024 Operating Environment: Ubuntu 12.10 32-bit installed as guest OS on VMware Player 5.0.2 for local computations and also to SSH to the remote machine (Ubuntu 12.04 64-bit OS) with the GPU card.
  • 19. [11] 3.3 Procedure 3.3.1 CUDA Fig 6 CUDA toolkit (v5.5) is downloaded and installed using terminal commands. The PATH and LD_LIBRARY_PATH environment variables are set for CUDA development. A CUDA program needs to have the .cu file (with the host and device code) and 3 configuration files (findcudalib.mk, NsightEclipse.xml and MakeFile) placed in the same folder/directory. The MakeFile must contain the concerned .cu file name. Upon using ‘make’ command, a .o object file and an executable file are created. Now the A "CUDA binary" and contains the compiled code that can directly be loaded and executed by a specific GPU.
  • 20. [12] 3.3.2 JCuda Fig 7 JCuda (v0.5.5) libraries have been compiled for CUDA 5.5. We used Binaries for Linux 64bit. It contains JAR files and SOs of all libraries. jcuda-0.5.5.jar is mostly used for compilation and running the JCuda applications.  For a minimum JCuda program “jcuda.java” without CUDA kernel code Compilation: Creates the "jcuda.class" file. Execution: - Prints the information about pointer created in the program.
  • 21. [13] Fig 8  For a full fledged JCuda program “Add.java” with separate CUDA kernel code “AddK.cu” (Manually) Compilation: - This kernel code is written exactly in the same way as it is done for CUDA and it has to be identified and accessed by specifying its name in the source code. It is compiled by the NVCC compiler to create either a 2 PTX file or 3 CUBIN file that can be loaded and executed using the Driver API. Loading and execution: - The PTX/CUBIN file has to be loaded, and a pointer to the kernel function has to be obtained 2 A human-readable (but hardly human-understandable) file containing a specific form of "assembler" source code. 3 A "CUDA binary" and contains the compiled code that can directly be loaded and executed by a specific GPU. They are specific for the Compute Capability of the GPU. Thus, latest samples prefer the use of PTX files, since they are compiled at runtime for the GPU of the target machine.
  • 22. [14] 3.3.3 AspectJ Fig 9 Compilation: - The .java and .aj files are listed in .lst file and –arglist option is used with ajc Execution: - To run the program, the aspectjrt.jar is included in the classpath and java command is used.
  • 23. [15] 3.3.4 JCuda and AspectJ Fig 10 To speculate the feasibility of AspectJ being compatible with JCuda, JCuda Utility classes JAR archive were also downloaded. The archive jcudaUtils-0.0.4.jar contains the "KernelLauncher" class which simplifies the setup and launching of kernels using the JCuda Driver API. It creates PTX files from inlined source code that is given as a String or from existing CUDA source files. PTX- or CUBIN files can be loaded and the kernels can be called more conveniently due to automatic setup of the kernel arguments. Compilation: - Again the .java, .cu and .aj files are listed in .lst file and –argfile option is used with ajc command. The source and target are specified along with the classpath of jcuda.jar as well as aspectjrt.jar Execution: - To run the program, the aspectjrt.jar and jcuda.jar are included in the classpath and java command is used.
  • 24. [16] Chapter 4 Related Work 4.1 Alternate technologies 4.1.1 Open Computing Language (OpenCL) - Another framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs and other processors - Consists of a language for writing kernels and APIs to define and control the platforms; A very primitive tool. - CUDA which is limited to Nvidia hardware and is directly connected to the execution platform but OpenCL is portable. - CUDA excels over OpenCL because it outperforms OpenCL when natively ported. - CUDA has more mature tools like debugger, profiler, CUBLAS and CUFFT. 4.1.2 Aparapi - An AMD product. - Converts Java bytecode to OpenCL at runtime and executes either on the GPU or in Java thread pool. 4.1.3 Rootbeer - GPU compiler used for CUDA; an alternative for nvcc 4.1.4 Java Annotations - Introduced in JDK 1.5; Organized data about the code, embedded within the code itself. - Options: - @Before – Run before the method execution @After – Run after the method returned a result @AfterReturning – Run after the method returned a result intercept the returned result as well. @AfterThrowing – Run after the method throws an exception @Around – Run around the method execution, combine all three advices above.
  • 25. [17] - Simpler to use than AspectJ as they do not need load-time weaving or separate complier. AspectJ needs ajc. - AspectJ supports all pointcuts. It is a more flexible approach and there is little runtime overhead. With annotations one can only use method-execution pointcut and there is more runtime overhead. 4.2 Past Projects Project Sumatra - An OpenJDK-backed project - Primary goal: To enable Java applications to take advantage of graphics processing units (GPUs) AND accelerated processing units (APUs)--whether they are discrete devices or integrated with a CPU--to improve performance. - Approach: Software developers annotate their code to indicate which is suited to the parallel nature of GPUs. When Java application is run on a system with an OpenCL- compatible GPU installed, the HotSpot JIT (just-in-time) compiler translates the annotated bits of code to OpenCL for processing on the GPU rather than the CPU. - Technical Challenges Solved: Java allows developers to write once and deploy everywhere and hence its widespread nature, but one area where it can fall flat is performance. Generally, Java applications cannot perform as well as native applications written for a specific OS. - Remaining Technical Challenges  mitigate the complexities of present-day GPU backend and layered standards  build compromise data schemes for both the JVM and GPU hardware  support flatter data structures (Complex values, vector, 2D arrays)  support mix of primitives and JVM-managed pointers  reduce data copying and inter-phase latency between ISA and loop kernels  apply existing technology on MapReduce (to JVM execution of GPU code)  interpret the thread-based Java concurrency model
  • 26. [18] Chapter 5 Observation and findings The two sets of codes written as a part of JCuda and Aspectj perform as good as the original host code written entirely as a part of JCuda does. Also, there is not much overhead. The interweaving of code is possible. Thus, the program gets simplified for the generic purposes and for anyone who wants to bypass device preparation steps. The device code however continues to be a separate entity written in C. At least with aspect oriented paradigm it could not be modified into a more easily accessible or a more readily usable code. Thus, the kernel computation continues to depend on CUDA/JCuda host segment.
  • 27. [19] Chapter 6 Limitations  Our project depends on availability of and access to CUDA enabled GeForce, Tesla or Quadro GPU either on local machine or on some remote machine. Else implementation or demonstration of any sort will not be possible.  If availability and access is assured, hardware configurations and compatibilities are quite specific. The compute capability and the 4 version of the CUDA driver API have crucial role. Further the Driver API is backward compatible and hence, mixing and matching versions will fail to execute. Environment variables need to be accurate for every tool/technique.  Obtaining output is not straightforward. The in-kernel printf() works like printf() of traditional C. It is executed as other device-side functions, i.e. in a per thread manner. However if it is a multi-threaded kernel, printf() will be executed by every thread, using specified thread data. The problem arises with the fact that final formatting of the printf() output has to take place on the host. The format string must be understood by the host-system’s compiler and C library. Although efforts have been made so that the format specifiers supported by CUDA’s printf() form a universal subset from the most common host compilers, but exact behavior is always host-O/S-dependent. o 4 All applications, plug-ins, and libraries on a system must use the same version of the CUDA driver API, since only one version of the CUDA device driver can be installed on a system. o All plug-ins and libraries used by an application must use the same version of the runtime. o All plug-ins and libraries used by an application must use the same version of any libraries that use the runtime (such as CUFFT, CUBLAS)
  • 28. [20] Chapter 7 Future Work Parallel programming models are need of the hour but they tend to have a somewhat unpredictable shelf-life. As the hardware platforms underneath them change so rapidly as per the trends, so it becomes tough to speculate on precise future of CUDA as it looks today. Nevertheless, much research is being done in this field. It is an era where almost every technology and every idea has found or is finding its way to Cloud Computing. With Internet access availability in the bigger picture, anything that has to do with data storage, manipulation and computation can eventually become a part of "dynamic web". So when the check list will cover abstraction, simplification, optimization etc, scalability and availability are the features that might bring CUDA more commercial success. A cloud-based machine with GPU is as good as a local or remote machine with GPU. Hadoop, a widely-used MapReduce framework, has already been combined with AMD Aparapi. On similar lines, the scope of on-going/future CUDA projects can be to have an easy-to-use API which allows easy implementation of MapReduce algorithms that make use of the GPU. Abstraction can again be a part of this combination as the API can serve dual purpose of hiding the complexity of GPU programming and leveraging the numerous benefits of Cloud. Thus, beyond single-GPU development, efforts in this direction can be extended to the domain of GPU clusters. For instance the project GPMR has taken up this idea in its body of work. Synopsis: MapReduce is the toolset deployed for large-dataset processing. As with a regular MapReduce model, the data-parallel processing is handled by GPUs. The existing GPU-MapReduce (GPMR library) work targets solo GPUs. Unlike CPUs, GPUs cannot source or sink network or I/O sources.
  • 29. [21] Scope and Possible Implementation: - Specific extensions for the GPU, including batching Maps and Reduces via Chunking to maintain GPU utilization - Adding accumulation to the Map sub stage - Adding a Partial Reduction sub stage - Assembling the MapReduce pipeline to achieve a high overlap of communication and computation. Areas of concern: - Programming multiGPU clusters lacks powerful toolsets and APIs. - GPU is often treated as a slave device in most GPU-computing applications. - GPMR is stand-alone and does not sit atop Hadoop or another MapReduce package. It does not handle fault tolerance. It does not provide a distributed file system (Hadoop Distributed File System to be precise).
  • 30. [22] Chapter 8 Conclusion The primary aim of the project, which was to speculate the feasibility of breaking down existing code into two entities and still get accurate results, is served well. The primitive idea of achieving parallelism with CUDA has now matured into a more sophisticated idea. Paradigms like Object Oriented Programming and Aspect Oriented Programming have graciously complimented CUDA without diminishing the power of this technology. The way JCuda has brought commercial success to CUDA; products like PyCuda have done the same by rendering flavors of other paradigms like Multi-paradigm approach encompassing object- oriented, imperative, procedural and reflective programming. FORTRAN CUDA, CUDA.NET, KappaCUDA. Examples abound. The list of programming paradigms, compiling/weaving tools, cloud computing techniques and other existing techniques is exhaustive. Further, CUDA is not the only technology in parallel computing race. Thus to conform to software quality metrics and to be certified as the 'fitness for purpose', any technique, in its full fledged form, will have to undergo experimentation. Every permutation and combination will contribute to this field.
  • 31. [23] References Websites: - 1. Official NVIDIA CUDA Home Page http://www.nvidia.in/object/cuda_home_new.html 2. Official Eclipse AspectJ Home Page https://www.eclipse.org/aspectj/doc/next/devguide/ajc-ref.html 3. Official JCuda Home Page http://www.jcuda.org/tutorial/TutorialIndex.html Journals/Research Papers: - 1. Aspect-Oriented Programming Beyond Dependency Injection By higeru Chiba and Rei Ishikawa, Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology (2008) 2. JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA By Yonghong Yan, Max Grossman, and Vivek Sarkar, Department of Computer Science, Rice University (2009) 3. MITHRA: Multiple data Independent Tasks on a Heterogeneous Resource Architecture By Reza Farivar, Abhishek Verma, Ellick M. Chan, Roy H. Campbell, Department of Computer Science, University of Illinois at Urbana-Champaign 201 N Goodwin Ave, Urbana, IL 61801-2302. 4. Tangling and scattering By Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes, Jean-Marc Loingtier and John Irwin, Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304, USA.