SlideShare uma empresa Scribd logo
1 de 4
Baixar para ler offline
The International Journal of Engineering And Science (IJES)
||Volume|| 2 ||Issue|| 01||Pages|| 250-253 ||2013||
ISSN: 2319 – 1813 ISBN: 2319 – 1805

  Advanced Trends of Heterogeneous Computing with CPU-GPU
               Integration: Comparative Study
                                         Ishan Rajani1, G Nanda Gopal2
          1. PG student at Department of Computer Engineering, Noble Group of Institution, Gujarat, India
          2. Asst. Prof. at Department of Computer Engineering, Noble Group of Institution, Gujarat, India



------------------------------------------------------------Abstract----------------------------------------------------
Over the last decades parallel-distributed computing becomes most popular than traditional centralized
computing. In distributed computing performance up-gradation is achieved by distributing workloads across the
participating nodes. One of the most important factors for improving the performance of this type of system is to
reduce average and standard deviation of job response time. Runtime insertion of new tasks of various sizes to
different nodes is one of the main reasons of Load unbalancing. Among the several latest concepts of Parallel-
Distributed Processing CPU-GPU Utilization is focused here. How the ideal portion of the CPU can be utilized
for GPU process and visa-versa. This paper also introduces the heterogeneous computing work flow integration
focused on CPU-GPU. The purposed system exploits the coarse-grain warp level parallelism. It is also
elaborated here that by using which architectures and frameworks developers are racing in the field of
heterogeneous computing.

Index Terms: Heterogeneous Computing, Coarse-Grained warp level parallelism, standard deviation of job
response time
---------------------------------------------------------------------------------------------------------------------------------------
Date of Submission: 28th December, 2012                                     Date of Publication: Date 20 th January 2013
---------------------------------------------------------------------------------------------------------------------------------------

                 I.     Introduction
The goal of proper work utilization in emerge                            performance of operations by mostly focusing on
techniques of multi core processing is one of the                        Single instruction multiple data (SIMT) models
most important aspects from the massively parallel                       which has concept very similar to single instruction
systems (MMP). Emerging parallel-distributed                             multiple data (SIMD) architecture of Flynn’s
technology has picked its highest progress in                            classification. For this NVIDIA has mentioned
processing power, data storage capacity, circuit                         architecture named unified graphics computing
integration scale etc. in last several few years but                     architecture (UGCA), here on the basis of that
still it doesn’t satisfy the increasing need of                          concept, and by the researches done on recent
advanced multi core processors. As we have                               decades we have proposed architecture of that
chosen participation of CPU ideal source in the                          concept with entire work flow.
activities of GPU, for this GPU computing has
emerged in recent years as a viable execution                                           II.      Related Work
platform for throughput oriented applications or                                  With the concept of general purpose
codes at multiple granularities by using several                         graphics units, manufacturer of GMA like
architecture developed by different GMA                                  NVIDIA, Intel, ARM and AMD etc. has developed
manufacturers.                                                           specified architecture for this integration. For
          In general GPUs started its era as                             example, AMD/ATi GPU developed Brook+ for
independent units for code execution but after that                      general purpose programming and, NVIDIA has
initialization phase of graphics unit several                            developed Compute Unified Device Architecture
developments has done for CPU-GPU integration.                           (CUDA) which provides greater programming
Various manufacturers have developed thread level                        environment for general purpose graphics devices.
parallelism for their graphics Accelerators. For
emerging a powerful computing paradigm General-                                   GPGPU programming paradigm of CUDA
Purpose computing on Graphics Processing Units                           increases performance, but still it doesn’t satisfies
(GPGPU) has been enhanced to manage hundreds                             extreme level scheduling of tasks allocated to GPU
of processing cores [6][13]. It helps to enhance                         [12]. For establishing such kind of situation a better

www.theijes.com                                             The IJES                                                       Page 250
Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study
way is to exploit a technique which utilizes ideal
GPU recourses which will reduce the complete
runtime of the it’s processing. For upgrading their
source utilization process recently other developers
also released some integrated solutions for above
mentioned system. Some most popular of them are
Intel’s Sandy Bridge, AMD’s Fusion APUs, and
ARM’s MALI.

          This paper proposes Heterogeneous
environment for CPU-GPU integration. CPU-GPU
Chip integration offers several advantages than
traditional systems like, reducing communication
cost, proper memory resource utilization of both
because rather than using separate memory space it
also focuses on the shared memory mechanism,
warp level parallelism for faster and efficient
performance[5] [7]. Warps can be defined as the
groups of threads; most of the warp structures
include 32 threads into one warp structure [7].
Researches on warp mechanism for threading also
noticed that for efficient control flow execution on
GPUs via dynamic warp formation and large warp
micro architectures, in which a common program                Figure 1.Unified Graphics Computing Architecture
counter is used for execution of all the threads of a
single warp together [5]. The scheduler selects                         UGCA uses PTX translator provided for
warps which is ready to execute, issues the next              converting PTX instructions into LLVM instruction
instruction to the active threads of that warp.               register. LLVM instruction register is used for a
                                                              kernel context. Proposed system has the advantage
                III.    Design                                that initialization of GPU, memory checking and,
          As we have discussed CUDA is used to                PTX to LLVM compilation are performed in
establish virtual kernel between distinguished                parallel.
sources of CPU-GPU it is also mentioned in given
architecture, these kernels will use preloaded data                           IV.     Workflow
after decomposition from the memory repositories                       In the first phase of UGCA architecture,
like buffer memory or temporary cache memory.                 Nvcc compiler translates code written in CUDA,
The key design problem is caused by the fact that             into PTX, then graphics driver indicates CUDA
the computation results of the CPU side are stored            compiler which translates PTX to LLVM which
into the main memory that is different from the               can be run on processing core. The most important
device memory. To overcome this problem, in our               functionality of Workload distribution module is it
architecture preserve two different memory                    generates two additional execution configurations
addresses in a pointer variable at a time. An                 for each (CPU and GPU). Then for performing
efficient design architecture of the proposed UGCA            further operation this module delivers the generated
architecture is shown in figure1.                             execution configuration, to the CPU and GPU
                                                              loaders.
          Mainly UGCA contains two phases in its
workflow. First phase of process includes task                          As shown in figure 1 top module involves
decomposition module and mapping of all                       all the configuration information of both CPU and
decomposed tasks also it manages the distribution             GPU kernels and basic PTX assembly code which
ratio to the kernel configuration information. And            will further converted into LLVM code for
second phase is designed to translate the Parallel            intermediate representation. The input of workload
Thread Execution (PTX) code into the LLVM                     distribution module is the kernel configuration
code. Here, PTX is pseudo assembly language code              information and the output specifies two different
used in CUDA. As shown in fig. 1 second                       portions of the kernel space, here dimension of grid
procedure is designed to translate PTX code into              is used for
the LLVM intermediate representation. On GPU                            workload distribution module. Proposed
device, runtime system passes the PTX code                    work distribution can split the kernel according to
through the CUDA device driver, which means that              the granularity of thread block. It also determines
the GPU executes the kernel in the original manner            the amount of the thread blocks to be detached
using the PTX-JIT compilation.                                from the grid considering the dimension of the grid

www.theijes.com                                    The IJES                                            Page 251
Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study
and workload distribution ratio. We are focusing on
coarse warp level parallelism, so in this condition                          V.     Comparison
sometimes number of threads doesn’t follow                               As we have discussed for above
preferred bock size, bound tests should be inserted            architecture, CUDA has been provided a massive
to kernel [5]. Otherwise threads may access                    parallelism for GPU as well as non GPU
memory out of index. For declaring block                       programming also [1]. Now let us compare with
boundries some preprogrammed semantics are                     other parallel programming environment according
available in cuda library.                                     to their performance. CUDA environment is only
                                                               available to NVIDIA GPU device, whereas the
_global_ void kernel(int n, int* x )                           Open CL standard may be implemented to run on
{                                                              any vendor’s hardware, such as traditional CPUs,
   int idx = threadidx.x + blockidx.x*blockdim.x;              and GPUs [2] [3]. One of the most effective
   if(i < n)                                                   advantages of OpenCL compare to CUDA is its
   {                                                           programming portability. But while comparison is
     // core programming                                       subject of matter then CUDA gives more speedups
    }                                                          by using its thread based warp level parallelism [4].
 }                                                             Because of the multi-vendor compatibility of
                                                               OpenCL makes enough difference between both of
         How many block should be covered in a                 these that pose significant challenges towards
block can be declared by inbuilt schematics of                 realizing a robust CUDA to OpenCL source to
CUDA library like threadidx and blockidx. In                   source translation [2]. Figure 2, shows comparison
above mentioned code, threadidx.x indicates the                of both CUDA and OpenCL for execution of
number associated with each thread in a block.                 various algorithms like fast Fourier transformation,
Block size of thread can be mapped with number of              double-precision matrix multiplication [9],
threads multiples of 32 like 192, 256 etc. Similarly           molecular dynamics, and simple 3D graphics
block size can be declared for two dimension, three            programs.
dimension and so on. Convention mentioned above
is very useful for two dimensional matrices [9].
Here blockidx.x refers to the label associated with a
block grid. The blockidx.x is similar to the thread
index except it refers to the number associated
with the block. Lets take an example for
elaborating the concept of blockdim.x, suppose we
want to load an array of 10 values in to a kernel
using two blocks, each one having size of 5 threads
in each. A major problem to do this kind of
operation is that, our thread index may only goes
for 0 to 4 as mentioned threads in one block. Here
to overcome this problem we can use a third
parameter given in CUDA stated as threaddix.x.
This holds the size of the block, like in our case              Figure 2. GFLOPS required to execute different
threadidx.x = 5.                                                      algorithms on CUDA and OpenCL

         A program written for CUDA application                         Performance of various environments
should assign memory spaces in the device                      according to required GFLOPS is analyzed by
memory of the graphics hardware. Then their                    results. NVIDIA Tesla C2050 GPU computing
memory addresses are used for the input and output             processors perform around 515 GFLOPS in double
data. In this UGCA model, data can be copied                   precision calculations, and the AMD FireStream
between the host memory and the dedicated                      9270 peaks at 240 GFLOPS [14][19]. In single
memory on the device. For this, the host system                precision performance, Nvidia Tesla C2050
should preserve pointer addresses pointing to the              computingprocessors perform around 1.03
locations in the device memory. The key design                 TFLOPS and the AMD FireStream 9270 cards
problem is caused by the fact that the                         peak at 1.2 TFLOPS.
Computation results of the CPU side are stored into
the main memory that is different from the device                           VI.      Conclusion
memory.                                                                  The paper has introduced architecture of
                                                               the concept which is widely adopted for coarse
                                                               grained warp level parallelism and also going to be
                                                               adopted as the advanced concept of warp level
                                                               parallelism. For example Success of Proposed
www.theijes.com                                     The IJES                                             Page 252
Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study
UGCA framework can be exampled with NVIDIA                        in parallel distributed computing system: MPI and
GeForce 9400 GT device. After the drastic success                 PVM”, Aligarh Muslim University
of mentioned graphics device. This architecture is         [9]    Ali Pinar and Cevdet Aykanat; “Sparse Matrix
going to be focused for more future enhancements.                 Decomposition with Optimal LoadBalancing ”,
We believe the cooperative heterogeneous                          TR06533 Bilkent, Ankara, Turkey
computing can be utilized in heterogeneous multi-          [10]   NVIDIA, Nvidia cuda sdks.
core processors which are expected to include even         [11]   N. Bell and M. Garland. Cusp; “Genetic parallel
more GPU cores as well as CPU cores.                              algorithms for sparse matrix and graph
As future work, we will first develop a dynamic                   conputations”, 2010
control scheme on deciding the workload                    [12]   J. Nickolls and W. Dally. The gpu computing era.
distribution ratio, warp scheduling, also plan to                 Micro, IEEE, 30(2):56-69, april 2010
design more effective thread block distribution            [13]   S. Huang, S. Xiao, W. Feng; “On the Energy
technique considering data access patterns and                    Efficiency of Graphics Processing Units for
thread divergence technique.                                      Scientific Computing”
                                                           [14]   Andrew J. Page:          “Adaptive Scheduling in
References                                                        Heterogeneous Distributed Computing Systems,
[1]   Ziming Zhong, Rychkov, V., Lastovetsky, A.                  National University of Ireland, Maynooth
      "Data Partitioning on Heterogeneous Multicore        [15]   Getting Started with cuda http://blogs.nvidia.com/
      and Multi-GPU Systems Using Functional               [16]   Nvidia GeForce Graphics Card,
      Performance       Models       of    Data-Parallel          http://www.nvidia.com/object/geforce_family.html
      Applications", Cluster Computing (CLUSTER),          [17]   Nvidia Tesla GPGPU system,
      2012 IEEE International Conference on , vol., no.,          http://www.rzg.mpg.de/computing
      pp.191-199, 24-28 Sept.                              [18]   The Supercomputing Blog
[2]   Sathre, P.; Gardner, M.; Wu-chun Feng; "Lost in             http://supercomputingblog.com/cuda/cuda-tutorial-
      Translation: Challenges in Automating CUDA-to-              1-getting-started/
      OpenCL        Translation", Parallel   Processing    [19]   Articles on advanced computing
      Workshops (ICPPW), 2012 41st International                  http://www.slcentral.com/articles/
      Conference on , vol., no., pp.89-96,10-
      13,Sept.,2012
[3]   Scogland, T.R.W., Rountree B., Wu-chun Feng,
      de Supinski, B.R.; "Heterogeneous Task
      Scheduling for Accelerated OpenMP", Parallel &
      Distributed Processing Symposium (IPDPS), 2012
      IEEE 26th International , vol., no., pp.144-155,
      21-25May2012.
[4]   Okuyama, T., Ino, F., Hagihara, K., "A Task
      Parallel Algorithm for Computing the Costs of
      All-Pairs Shortest Paths on the CUDA-Compatible
      GPU", Parallel and Distributed Processing with
      Applications, 2008. ISPA '08. International
      Symposium, vol., no., pp.284-291
[5]   Goli, M., Garba, M.T., GonzalezVélez, H.,
       "Streaming Dynamic Coarse-Grained CPU/GPU
      Workloads with Heterogeneous Pipelines in
      FastFlow" High Performance Computing and
      Communication & 2012 IEEE 9th International
      Conference on Embedded Software and Systems
      (HPCC-ICESS), 2012 IEEE 14th International
      Conference on , vol., no., pp.445-452, 25-27 June
      2012
[6]   Manish Arora, “The architecture and Evolution of
      CPU-GPU Systems for General Purpose
      Computing ” , By University of California, San
      Diago
[7]   Veynu Narasiman, Michael Shebanow, Chang Joo
      Lee, “ Improving GPU Performance via Large
      Warps and Two-Level Warp Scheduling ”, The
      University of Texas at Austin
[8]   Rafiqul Zaman Khan, Md Firoj Ali: “A
      comparative study on parallel programming tools

www.theijes.com                                 The IJES                                            Page 253

Mais conteúdo relacionado

Mais procurados

Cooperative Linux
Cooperative LinuxCooperative Linux
Cooperative LinuxAnkit Singh
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The SupercomputerAnkit Singh
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerAMD Developer Central
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Dhiraj Chaudhary
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...jemin lee
 
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...IOSR Journals
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
Warehouse scale computer
Warehouse scale computerWarehouse scale computer
Warehouse scale computerHassan A-j
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale ComputersAn introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale ComputersAlessio Villardita
 
Talk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWATalk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWADishant Ailawadi
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeireimec
 

Mais procurados (18)

Cooperative Linux
Cooperative LinuxCooperative Linux
Cooperative Linux
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
B9 cmis
B9 cmisB9 cmis
B9 cmis
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Warehouse scale computer
Warehouse scale computerWarehouse scale computer
Warehouse scale computer
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale ComputersAn introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale Computers
 
Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
Talk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWATalk on Parallel Computing at IGWA
Talk on Parallel Computing at IGWA
 
PPU_PNSS-1_ICS-2014
PPU_PNSS-1_ICS-2014PPU_PNSS-1_ICS-2014
PPU_PNSS-1_ICS-2014
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeire
 

Destaque

The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...Brian Solis
 
Open Source Creativity
Open Source CreativityOpen Source Creativity
Open Source CreativitySara Cannon
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)maditabalnco
 
What's Next in Growth? 2016
What's Next in Growth? 2016What's Next in Growth? 2016
What's Next in Growth? 2016Andrew Chen
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your BusinessBarry Feldman
 

Destaque (7)

The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...
 
Open Source Creativity
Open Source CreativityOpen Source Creativity
Open Source Creativity
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)
 
What's Next in Growth? 2016
What's Next in Growth? 2016What's Next in Growth? 2016
What's Next in Growth? 2016
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
 

Semelhante a The International Journal of Engineering and Science (The IJES)

A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Map SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUMap SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUZhengjie Lu
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingMesbah Uddin Khan
 
An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...Asif Farooq
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...IJCSIS Research Publications
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Image Processing Application on Graphics processors
Image Processing Application on Graphics processorsImage Processing Application on Graphics processors
Image Processing Application on Graphics processorsCSCJournals
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Semelhante a The International Journal of Engineering and Science (The IJES) (20)

A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Map SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUMap SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPU
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
 
openCL Paper
openCL PaperopenCL Paper
openCL Paper
 
An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
An Exposition of Performance Comparison of Graphic Processing Unit Virtualiza...
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Image Processing Application on Graphics processors
Image Processing Application on Graphics processorsImage Processing Application on Graphics processors
Image Processing Application on Graphics processors
 
Amd fusion apus
Amd fusion apusAmd fusion apus
Amd fusion apus
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

The International Journal of Engineering and Science (The IJES)

  • 1. The International Journal of Engineering And Science (IJES) ||Volume|| 2 ||Issue|| 01||Pages|| 250-253 ||2013|| ISSN: 2319 – 1813 ISBN: 2319 – 1805 Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study Ishan Rajani1, G Nanda Gopal2 1. PG student at Department of Computer Engineering, Noble Group of Institution, Gujarat, India 2. Asst. Prof. at Department of Computer Engineering, Noble Group of Institution, Gujarat, India ------------------------------------------------------------Abstract---------------------------------------------------- Over the last decades parallel-distributed computing becomes most popular than traditional centralized computing. In distributed computing performance up-gradation is achieved by distributing workloads across the participating nodes. One of the most important factors for improving the performance of this type of system is to reduce average and standard deviation of job response time. Runtime insertion of new tasks of various sizes to different nodes is one of the main reasons of Load unbalancing. Among the several latest concepts of Parallel- Distributed Processing CPU-GPU Utilization is focused here. How the ideal portion of the CPU can be utilized for GPU process and visa-versa. This paper also introduces the heterogeneous computing work flow integration focused on CPU-GPU. The purposed system exploits the coarse-grain warp level parallelism. It is also elaborated here that by using which architectures and frameworks developers are racing in the field of heterogeneous computing. Index Terms: Heterogeneous Computing, Coarse-Grained warp level parallelism, standard deviation of job response time --------------------------------------------------------------------------------------------------------------------------------------- Date of Submission: 28th December, 2012 Date of Publication: Date 20 th January 2013 --------------------------------------------------------------------------------------------------------------------------------------- I. Introduction The goal of proper work utilization in emerge performance of operations by mostly focusing on techniques of multi core processing is one of the Single instruction multiple data (SIMT) models most important aspects from the massively parallel which has concept very similar to single instruction systems (MMP). Emerging parallel-distributed multiple data (SIMD) architecture of Flynn’s technology has picked its highest progress in classification. For this NVIDIA has mentioned processing power, data storage capacity, circuit architecture named unified graphics computing integration scale etc. in last several few years but architecture (UGCA), here on the basis of that still it doesn’t satisfy the increasing need of concept, and by the researches done on recent advanced multi core processors. As we have decades we have proposed architecture of that chosen participation of CPU ideal source in the concept with entire work flow. activities of GPU, for this GPU computing has emerged in recent years as a viable execution II. Related Work platform for throughput oriented applications or With the concept of general purpose codes at multiple granularities by using several graphics units, manufacturer of GMA like architecture developed by different GMA NVIDIA, Intel, ARM and AMD etc. has developed manufacturers. specified architecture for this integration. For In general GPUs started its era as example, AMD/ATi GPU developed Brook+ for independent units for code execution but after that general purpose programming and, NVIDIA has initialization phase of graphics unit several developed Compute Unified Device Architecture developments has done for CPU-GPU integration. (CUDA) which provides greater programming Various manufacturers have developed thread level environment for general purpose graphics devices. parallelism for their graphics Accelerators. For emerging a powerful computing paradigm General- GPGPU programming paradigm of CUDA Purpose computing on Graphics Processing Units increases performance, but still it doesn’t satisfies (GPGPU) has been enhanced to manage hundreds extreme level scheduling of tasks allocated to GPU of processing cores [6][13]. It helps to enhance [12]. For establishing such kind of situation a better www.theijes.com The IJES Page 250
  • 2. Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study way is to exploit a technique which utilizes ideal GPU recourses which will reduce the complete runtime of the it’s processing. For upgrading their source utilization process recently other developers also released some integrated solutions for above mentioned system. Some most popular of them are Intel’s Sandy Bridge, AMD’s Fusion APUs, and ARM’s MALI. This paper proposes Heterogeneous environment for CPU-GPU integration. CPU-GPU Chip integration offers several advantages than traditional systems like, reducing communication cost, proper memory resource utilization of both because rather than using separate memory space it also focuses on the shared memory mechanism, warp level parallelism for faster and efficient performance[5] [7]. Warps can be defined as the groups of threads; most of the warp structures include 32 threads into one warp structure [7]. Researches on warp mechanism for threading also noticed that for efficient control flow execution on GPUs via dynamic warp formation and large warp micro architectures, in which a common program Figure 1.Unified Graphics Computing Architecture counter is used for execution of all the threads of a single warp together [5]. The scheduler selects UGCA uses PTX translator provided for warps which is ready to execute, issues the next converting PTX instructions into LLVM instruction instruction to the active threads of that warp. register. LLVM instruction register is used for a kernel context. Proposed system has the advantage III. Design that initialization of GPU, memory checking and, As we have discussed CUDA is used to PTX to LLVM compilation are performed in establish virtual kernel between distinguished parallel. sources of CPU-GPU it is also mentioned in given architecture, these kernels will use preloaded data IV. Workflow after decomposition from the memory repositories In the first phase of UGCA architecture, like buffer memory or temporary cache memory. Nvcc compiler translates code written in CUDA, The key design problem is caused by the fact that into PTX, then graphics driver indicates CUDA the computation results of the CPU side are stored compiler which translates PTX to LLVM which into the main memory that is different from the can be run on processing core. The most important device memory. To overcome this problem, in our functionality of Workload distribution module is it architecture preserve two different memory generates two additional execution configurations addresses in a pointer variable at a time. An for each (CPU and GPU). Then for performing efficient design architecture of the proposed UGCA further operation this module delivers the generated architecture is shown in figure1. execution configuration, to the CPU and GPU loaders. Mainly UGCA contains two phases in its workflow. First phase of process includes task As shown in figure 1 top module involves decomposition module and mapping of all all the configuration information of both CPU and decomposed tasks also it manages the distribution GPU kernels and basic PTX assembly code which ratio to the kernel configuration information. And will further converted into LLVM code for second phase is designed to translate the Parallel intermediate representation. The input of workload Thread Execution (PTX) code into the LLVM distribution module is the kernel configuration code. Here, PTX is pseudo assembly language code information and the output specifies two different used in CUDA. As shown in fig. 1 second portions of the kernel space, here dimension of grid procedure is designed to translate PTX code into is used for the LLVM intermediate representation. On GPU workload distribution module. Proposed device, runtime system passes the PTX code work distribution can split the kernel according to through the CUDA device driver, which means that the granularity of thread block. It also determines the GPU executes the kernel in the original manner the amount of the thread blocks to be detached using the PTX-JIT compilation. from the grid considering the dimension of the grid www.theijes.com The IJES Page 251
  • 3. Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study and workload distribution ratio. We are focusing on coarse warp level parallelism, so in this condition V. Comparison sometimes number of threads doesn’t follow As we have discussed for above preferred bock size, bound tests should be inserted architecture, CUDA has been provided a massive to kernel [5]. Otherwise threads may access parallelism for GPU as well as non GPU memory out of index. For declaring block programming also [1]. Now let us compare with boundries some preprogrammed semantics are other parallel programming environment according available in cuda library. to their performance. CUDA environment is only available to NVIDIA GPU device, whereas the _global_ void kernel(int n, int* x ) Open CL standard may be implemented to run on { any vendor’s hardware, such as traditional CPUs, int idx = threadidx.x + blockidx.x*blockdim.x; and GPUs [2] [3]. One of the most effective if(i < n) advantages of OpenCL compare to CUDA is its { programming portability. But while comparison is // core programming subject of matter then CUDA gives more speedups } by using its thread based warp level parallelism [4]. } Because of the multi-vendor compatibility of OpenCL makes enough difference between both of How many block should be covered in a these that pose significant challenges towards block can be declared by inbuilt schematics of realizing a robust CUDA to OpenCL source to CUDA library like threadidx and blockidx. In source translation [2]. Figure 2, shows comparison above mentioned code, threadidx.x indicates the of both CUDA and OpenCL for execution of number associated with each thread in a block. various algorithms like fast Fourier transformation, Block size of thread can be mapped with number of double-precision matrix multiplication [9], threads multiples of 32 like 192, 256 etc. Similarly molecular dynamics, and simple 3D graphics block size can be declared for two dimension, three programs. dimension and so on. Convention mentioned above is very useful for two dimensional matrices [9]. Here blockidx.x refers to the label associated with a block grid. The blockidx.x is similar to the thread index except it refers to the number associated with the block. Lets take an example for elaborating the concept of blockdim.x, suppose we want to load an array of 10 values in to a kernel using two blocks, each one having size of 5 threads in each. A major problem to do this kind of operation is that, our thread index may only goes for 0 to 4 as mentioned threads in one block. Here to overcome this problem we can use a third parameter given in CUDA stated as threaddix.x. This holds the size of the block, like in our case Figure 2. GFLOPS required to execute different threadidx.x = 5. algorithms on CUDA and OpenCL A program written for CUDA application Performance of various environments should assign memory spaces in the device according to required GFLOPS is analyzed by memory of the graphics hardware. Then their results. NVIDIA Tesla C2050 GPU computing memory addresses are used for the input and output processors perform around 515 GFLOPS in double data. In this UGCA model, data can be copied precision calculations, and the AMD FireStream between the host memory and the dedicated 9270 peaks at 240 GFLOPS [14][19]. In single memory on the device. For this, the host system precision performance, Nvidia Tesla C2050 should preserve pointer addresses pointing to the computingprocessors perform around 1.03 locations in the device memory. The key design TFLOPS and the AMD FireStream 9270 cards problem is caused by the fact that the peak at 1.2 TFLOPS. Computation results of the CPU side are stored into the main memory that is different from the device VI. Conclusion memory. The paper has introduced architecture of the concept which is widely adopted for coarse grained warp level parallelism and also going to be adopted as the advanced concept of warp level parallelism. For example Success of Proposed www.theijes.com The IJES Page 252
  • 4. Advanced Trends of Heterogeneous Computing with CPU-GPU Integration: Comparative Study UGCA framework can be exampled with NVIDIA in parallel distributed computing system: MPI and GeForce 9400 GT device. After the drastic success PVM”, Aligarh Muslim University of mentioned graphics device. This architecture is [9] Ali Pinar and Cevdet Aykanat; “Sparse Matrix going to be focused for more future enhancements. Decomposition with Optimal LoadBalancing ”, We believe the cooperative heterogeneous TR06533 Bilkent, Ankara, Turkey computing can be utilized in heterogeneous multi- [10] NVIDIA, Nvidia cuda sdks. core processors which are expected to include even [11] N. Bell and M. Garland. Cusp; “Genetic parallel more GPU cores as well as CPU cores. algorithms for sparse matrix and graph As future work, we will first develop a dynamic conputations”, 2010 control scheme on deciding the workload [12] J. Nickolls and W. Dally. The gpu computing era. distribution ratio, warp scheduling, also plan to Micro, IEEE, 30(2):56-69, april 2010 design more effective thread block distribution [13] S. Huang, S. Xiao, W. Feng; “On the Energy technique considering data access patterns and Efficiency of Graphics Processing Units for thread divergence technique. Scientific Computing” [14] Andrew J. Page: “Adaptive Scheduling in References Heterogeneous Distributed Computing Systems, [1] Ziming Zhong, Rychkov, V., Lastovetsky, A. National University of Ireland, Maynooth "Data Partitioning on Heterogeneous Multicore [15] Getting Started with cuda http://blogs.nvidia.com/ and Multi-GPU Systems Using Functional [16] Nvidia GeForce Graphics Card, Performance Models of Data-Parallel http://www.nvidia.com/object/geforce_family.html Applications", Cluster Computing (CLUSTER), [17] Nvidia Tesla GPGPU system, 2012 IEEE International Conference on , vol., no., http://www.rzg.mpg.de/computing pp.191-199, 24-28 Sept. [18] The Supercomputing Blog [2] Sathre, P.; Gardner, M.; Wu-chun Feng; "Lost in http://supercomputingblog.com/cuda/cuda-tutorial- Translation: Challenges in Automating CUDA-to- 1-getting-started/ OpenCL Translation", Parallel Processing [19] Articles on advanced computing Workshops (ICPPW), 2012 41st International http://www.slcentral.com/articles/ Conference on , vol., no., pp.89-96,10- 13,Sept.,2012 [3] Scogland, T.R.W., Rountree B., Wu-chun Feng, de Supinski, B.R.; "Heterogeneous Task Scheduling for Accelerated OpenMP", Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International , vol., no., pp.144-155, 21-25May2012. [4] Okuyama, T., Ino, F., Hagihara, K., "A Task Parallel Algorithm for Computing the Costs of All-Pairs Shortest Paths on the CUDA-Compatible GPU", Parallel and Distributed Processing with Applications, 2008. ISPA '08. International Symposium, vol., no., pp.284-291 [5] Goli, M., Garba, M.T., GonzalezVélez, H., "Streaming Dynamic Coarse-Grained CPU/GPU Workloads with Heterogeneous Pipelines in FastFlow" High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on , vol., no., pp.445-452, 25-27 June 2012 [6] Manish Arora, “The architecture and Evolution of CPU-GPU Systems for General Purpose Computing ” , By University of California, San Diago [7] Veynu Narasiman, Michael Shebanow, Chang Joo Lee, “ Improving GPU Performance via Large Warps and Two-Level Warp Scheduling ”, The University of Texas at Austin [8] Rafiqul Zaman Khan, Md Firoj Ali: “A comparative study on parallel programming tools www.theijes.com The IJES Page 253