SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
Volume Graphics (2008), pp. 1–8
H. - C. Hege, D. Laidlaw (Editors)




 Memory Efficient GPU-Based Ray Casting for Unstructured
                    Volume Rendering

                                      A. Maximo1 and S. Ribeiro1 and C. Bentes2 and A. Oliveira1 and R. Farias1

                                                   1 COPPE   - Federal University of Rio de Janeiro, Brazil
                                                    2 DESC    - State University of Rio de Janeiro, Brazil




         Abstract

         Volume ray casting algorithms benefit greatly with recent increase of GPU capabilities and power. In this paper,
         we present a novel memory efficient ray casting algorithm for unstructured grids completely implemented on GPU
         using a recent off-the-shelf nVidia graphics card. Our approach is built upon a recent CPU ray casting algorithm,
         called VF-Ray, that considerably reduces the memory footprint while keeping good performance. In addition to
         the implementation of VF-Ray in the graphics hardware, we also propose a restructuring in its data structures. As
         a result, our algorithm is much faster than the original software version, while using significantly less memory, it
         needed only one-half of its previous memory usage. Comparing our GPU implementation to other hardware-based
         ray casting algorithms, our approach used between three to ten times less memory. These results made it possible
         for our GPU implementation to handle larger datasets on GPU than previous approaches.
         Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Display algorithms I.3.7
         [Three-Dimensional Graphics and Realism]: Raytracing



1. Introduction                                                                   cells intersected throughout the ray traversal. Every pair of
                                                                                  intersections is used to compute a contribution for the pixel
Direct volume rendering is an important technique for visu-                       color and opacity. In the cell projection approach, however,
alizing 3-D data volumes. It provides high quality images                         each polyhedral cell of the volumetric data is projected onto
of the interior of the volume data that have valuable con-                        the screen, requiring the cells to be first sorted in visibility
tributions in many different areas, such as medical images                        order. It is done in order to correctly determine the contribu-
analysis, geological data visualization, weather simulations,                     tion of each cell for the color and opacity values of the final
or fluid dynamics interactions.                                                   image pixels.
   Recently, the increasing capability and performance of
commercial graphics hardware, such as ATI Radeon and                                 The great advantages of ray casting methods are that the
nVidia GeForce, brought hardware implementations of vol-                          computation for each pixel is independent of the others, and
ume rendering to gain the attention of researchers, as                            the traveling of a ray throughout the mesh is guided by the
they allow renderers to achieve interactive frame rates.                          connectivity of the cells, avoiding the need of sorting the
Several hardware-based algorithms have been proposed in                           cells. The disadvantage is that cells connectivity has to be
the literature [WKME03b, WKME03a, EC05, MMFE06].                                  explicitly computed and kept in memory.
Some approaches are based on the cell projection algo-
rithm [WKME03a, MMFE06] and others on the ray casting                                When these two approaches are compared in terms of per-
algorithm [WKME03b, EC05].                                                        formance, for smaller datasets, the cell projection approach
                                                                                  performs better [EC05, MMFE06]. For larger datasets, how-
  In the ray casting approach for unstructured grids, rays are                    ever, ray casting usually performs better than cell projection
casted from the viewpoint through every pixel of the image,                       because it does not suffer from the high cost of sorting the
and by line/plane intersection, it is possible to determine all                   data before the rendering process.

submitted to Volume Graphics (2008)
2                   A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting

   Although ray casting has been more efficient for large              however, has to keep some large auxiliary data structures
datasets, their recent implementations on the GPU present             in memory. In the work by Pina et al. [APCBRF07], two
high memory requirements [WKME03b, EC05]. These high                  new data to the ray casting algorithm were proposed, they
requirements, make the GPU implementations unsuitable for             achieved significant gains in memory usage. Unfortunately,
handling large datasets, i.e. made of several millions of tetra-      their algorithms are slower than Bunyk. The efforts in CPU-
hedra. The graphics hardware memory is usually smaller                based approaches continue with the recent work of Ribeiro
than the CPU memory, imposing strong limitations in the               et al. [RMB∗ 07] introducing the VF-Ray algorithm. In their
store of cell connectivity.                                           work, instead of keeping the information of all faces in mem-
                                                                      ory like Bunyk, VF-Ray keeps only a limited set of faces,
   In this paper, we propose a novel hardware-based ray cast-
                                                                      in order to bring down the memory footprint. Nonetheless,
ing algorithm for tetrahedral meshes that tackles this prob-
                                                                      VF-Ray stores, in cell and face data structures, some indices
lem. Our idea is to take advantage of a recent memory effi-
                                                                      for faces, which we avoid by indexing the faces in a differ-
cient software implementation of ray casting, called Visible
                                                                      ent way, as described in section 4.1. Also, they keep a list
Faces Driven Ray Casting [RMB∗ 07] – VF-Ray. This algo-
                                                                      of cells for each vertex to properly determine the next cell
rithm uses a reduced and non-redundant data structure that
                                                                      when a ray hits a vertex or an edge. This data structure was
provides consistent and significant gains in memory usage,
                                                                      replaced in our implementation by a simpler method, in or-
and explores ray coherence in order to maintain in memory
                                                                      der to avoid this problem. All these approaches, however,
only information of the most recent traversals. As VF-Ray
                                                                      are time-consuming and could not achieve real-time perfor-
used only from 1/3 to 1/6 of the memory used by previous
                                                                      mance.
approaches, it can deal with larger datasets. However, as VF-
Ray was implemented in the CPU, interactive performance                  In order to achieve interactive frame rates, graph-
has never been achieved.                                              ics processing units (GPUs) have been used [WMFC02,
                                                                      WKME03a, EC05, BPaLDCS06, MMFE06]. In the last
   In order to achieve interactive performance with VF-Ray,
                                                                      years, the major increase in the performance and pro-
our hardware implementation uses General Purpose compu-
                                                                      grammability of GPUs has led to significant increase in
tation on GPU (GPGPU) [Har04] and the Compute Unified
                                                                      the ray casting performance. Weiler et al. [WKME03a] im-
Device Architecture – CUDA [NVI07] from nVidia. In ad-
                                                                      plemented a hardware-based ray casting algorithm, that is
dition to the use of graphics hardware, we propose a restruc-
                                                                      based on the work of Garrity et al.. This algorithm was
ture in the original data structures of VF-Ray. As a result, our
                                                                      further extended by Espinha and Celes [EC05]. The orig-
algorithm is much faster than the original software version,
                                                                      inal algorithm uses the pre-integration technique to eval-
while using significantly less memory.
                                                                      uate the final color, while Espinha and Celes approach
   Our results show that, when our algorithm is compared to           uses the partial pre-integration technique [MA04]. Bernar-
other recent ray casting hardware-based algorithms, it uses           don et al. [BPaLDCS06] proposes a hardware-based algo-
from 77% to 90% less memory. In this way, our algorithm               rithm based on Bunyk’s implementation and depth peel-
could render datasets for which the others failed due to the          ing to render non-convex unstructured grids. On the other
lack of memory in the graphics card.                                  hand, Weiler et al. [WMKE04] propose the use of tetrahe-
                                                                      dral strips, in order to deal with the problem of storing the
  The remainder of this paper is organized as follows. In the
                                                                      whole dataset in GPU memory.
next section we discuss the related work. Section 3 briefly
describes the VF-Ray algorithm. Section 4 describes our                  In the scope of projection algorithms, there are some soft-
hardware ray casting algorithm and the improvements in its            ware implementations [MHC90, FMS00] that provide flex-
data structures. In section 5, we present the results of our          ibility and easy parallelization. The GPU implementations
most important experiments. Finally, in section 6, we present         of projection algorithms, however, are more notorious. One
our conclusions and proposals for future work.                        of the first hardware implementation of cell projection was
                                                                      proposed by Wylie et al. [WMFC02], and was based on the
                                                                      Projected Tetrahedra (PT) algorithm by Shirley and Tuch-
2. Related Work
                                                                      man [ST90]. The PT algorithm decomposes the tetrahedral
Many volume rendering algorithms have been proposed                   cells into triangles, according to a projection class, and trans-
throughout the years. Volume ray casting is the most popular          mit them to the triangle rendering hardware. Wylie et al. ex-
one, and several implementations have been developed. Gar-            tended PT algorithm to implement all tetrahedral primitives
rity [Gar90] proposes the use of the cell connectivity to com-        directly into the GPU. Further, Weiler et al. [WKME03b]
pute the ray traversal inside the volume. This approach was           developed a combined ray casting and cell projection ap-
further improved by Bunyk et al. [BKS97], where all cells             proach, where the projection is done in a view-independent
are broken into faces, in a preprocessing step. The Bunyk’s           way. More recently, Marroquim et al. [MMFE06] proposed
algorithm determine the visible faces, that when projected            another improvement in the PT algorithm that was imple-
on the screen, define the entry point for each pixel, for              mented almost entirely on GPU and could keep the whole
the ray casting through the volumetric data. This approach,           model in the texture memory, avoiding the bus transfer over-

                                                                                                           submitted to Volume Graphics (2008)
A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting                 3

head. Their proposal also used partial pre-integration in or-
der to improve the quality of the image. The work of Calla-
han et al. [CICS05] introduced the Hardware-Assisted Visi-
bility Sorting (HAVS) algorithm which simplifies the CPU-
based processing and shifts much of the sorting burden to
the GPU. More recently, Callahan et al. [CBPS06] used the
HAVS algorithm to provide a client-server system that incre-
mentally streams parts of the mesh from a server to a client.
                                                                                  Figure 1: Ray coherence for one visible face.

3. Visible Faces Driven Ray Casting
The Visible Face Driven Ray Casting algorithm, called VF-                  kept in the computedFaces buffer while the visible set is be-
Ray [RMB∗ 07], deals with unstructured grids composed of                   ing computed. When all pixels in a visible set are processed,
tetrahedral or hexahedral cells. Each tetrahedral cell is com-             the VF-Ray algorithm clears the computedFaces buffer.
posed of four faces and each hexahedral cell is composed of
six faces. The algorithm is based on the following premise:                3.1. Handling Degeneracies
the storing of the information about the faces of each cell is
the key for memory consumption and execution time. This                    Degenerated situations may occur in the ray casting meth-
information is stored in a face data structure, that includes              ods, when a ray hits an edge or a vertex. In Bunyk approach,
the geometry and the face parameters, constants for the plane              the algorithm would check for the next intersection only on
equation defined by the face, which is the most consuming                   the cells neighboring the current cell faces, and, in this way,
data structure in ray casting.                                             would not be able to continue the ray traversal, generating
                                                                           incorrect pixels colors. VF-Ray can handle these two situa-
   The basic idea behind VF-Ray is to explore ray coherence,               tions by using the Use_set data structure. When a ray hits a
improving caching performance by keeping in memory only                    vertex v, VF-Ray can find the next intersection by scanning
the face data of the traversals of a set of nearby rays. The               the cells belonging to the Use_set of v. When the ray hits an
nearby rays, which will be casted through the neighboring                  edge v0 v1 , Vf-Ray can find the next intersection by scanning
pixels, are under the projection of a given visible face, as               the Use_set of v0 and v1 . In this way, VF-ray guarantees that
shown in Figure 1. This set of pixels is called visible set.               the image will be correctly generated.
   The algorithm first creates a list of vertices, a list of cells
and, for each vertex, a list of all cells incident to the vertex,          4. Memory Efficient GPU-Based Ray Casting
called the Use_set of the vertex. The rendering process starts
by determining the visible faces from the external faces of                Our hardware-based ray casting algorithm relies on the
the volume. An external face is a face that belongs to only                GPGPU implementation concept to parallelize the original
one cell and is not shared by any other cell [BKS97] and a                 VF-Ray algorithm. The idea behind this concept is to assume
visible face is an external face whose normal makes an an-                 the graphics hardware as a coprocessor capable of perform-
gle greater than 90o with the viewing direction. The VF-Ray                ing high arithmetic intensity tasks, regardless the graphics
algorithm processes each visible face at a time, projecting it             board specific computations. The architecture used to imple-
onto the screen, determining the visible set. For each pixel               ment our GPGPU ray casting algorithm was CUDA [NVI07]
in this set, the ray traversal is done and each pair of inter-             due to its simplicity and the possibility of exploitation of
sections is computed. The internal faces are determined by                 all GPU resources. In this architecture, the implementation
the intersections and the faces data are stored in a temporary             is done through kernels, that run in multiple threads. The
buffer, called computedFaces. Whenever a ray casted from                   threads are grouped in blocks, meaning that threads inside
the current visible set hits an internal face, already stored              the same block share the resources of one GPU multiproces-
in the buffer, the VF-Ray algorithm just reads the face data               sor. Each multiprocessor has a Single Instruction Multiple
from the computedFaces buffer, without having to recom-                    Data (SIMD) architecture which runs the same kernel in-
pute the faces parameters. The face data are used to evaluate              structions but operating on different data.
the traversed distance d and the entry and exit scalar values,                We divide our algorithm in three steps and, as a result, in
s0 and s1 respectively. Finally, the cell contribution for the             three different kernels (see Figure 2). In the first kernel, the
pixel color and opacity is computed using d, s0 and s1 and                 external faces are read and the visible faces are determined.
an illumination integral.                                                  The second kernel computes the projection of each visible
                                                                           face, splitting them in pixels on the screen space. Finally,
   VF-Ray algorithm explores ray coherence, using the visi-
                                                                           the third kernel evaluates the ray casting algorithm over each
ble face information to guide the creation and destruction of
                                                                           visible face pixel previously computed.
face data in memory. The faces intersected by neighbor rays
tend to be the same (as shown in Figure 1) and their data are                 The first kernel is important to reduce the computation

submitted to Volume Graphics (2008)
4                   A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting

performed by the next two kernels. In the first kernel, the
computation is bounded to the number of external faces,
while the computation of the second and third kernels has
the cost based on the number of visible faces which is ap-
proximately half the number of external faces.




                                                                      Figure 3: The basic data structures for our GPU-Based ray
                                                                      casting algorithm.



                                                                      and targets to solve two 3x3 linear systems: one to interpo-
                                                                      late the z coordinate; and another to interpolate the scalar
                                                                      value s. In this kernel, only the z coordinate interpolation is
                                                                      done for each external face, once this face is checked as vis-
       Figure 2: The three kernels of our algorithm.                  ible the s value interpolation is made and both interpolation
                                                                      parameters are stored in the list of visible faces (visFaces).
                                                                          The visibility for an external face is done by comparing
4.1. Data structures                                                  the z coordinate of the fourth vertex of the tetrahedron, i.e.
                                                                      the vertex that does not belong to the face, with the z coordi-
Before the three GPU kernels can be processed, we pre-                nate of its projection on the external face. The fourth vertex
compute the following data structures. From the list of ver-          projection is done using the z coordinate interpolation pa-
tices (vertList) and list of tetrahedra (tetList) of the volu-        rameters computed for the external face. The face is visible
metric dataset, we compute the tetrahedra connectivity (con-          if the projected z is smaller than the z coordinate of the fourth
Tet) which stores for each tetrahedron face, the tetrahedron          vertex.
id ti which shares the face. Figure 3 shows the data struc-
ture used by our algorithm, illustrating the first tetrahedron            Only the visible face parameters are written in CUDA
(tetrahedron0 ). The vertList contains the x, y and z coordi-         global memory, that is, the z and s interpolation parameters
nates and the scalar value s of each vertex. The tetList con-         are stored in the visFaces list.
tains the vertices id vi which composes each tetrahedron.                The first kernel employs the visible face test, i.e. back face
   To avoid building other data structures, we use the vertices       culling, and face creation for each thread, using the maxi-
order inside the tetList to determine each tetrahedron face.          mum number of threads available per block. The number of
The vertices of the face fi are vi , v(i+1)mod4 and v(i+2)mod4 .      external faces is fixed for each volume data and thus is the
For example, the face f2 of some tetrahedron ti is composed           number of threads across all blocks in this kernel. The com-
by the vertices v2 , v3 and v0 of ti , as can be seen in Figure 3.    putation time and memory footprint spent in the first kernel
In addition to these data structures, our algorithm uses the          correspond to less than 5% of the total (time and memory).
external faces list (extFaces) pre-computed in the same way           Nevertheless, the main role of this kernel is to reduce the
as Bunyk.                                                             number of threads that will be used by the next two kernels.
                                                                      From now on, they will run over the visFaces list instead of
   The original VF-Ray algorithm stores three vertices in-            the extFaces list, which is about half the size of the extFaces
dices inside its face data structure. In our algorithm, since         list.
we target to run in GPU limited memory, we store only two
indices for each face created. These indices are the tetrahe-
dron ti and the face fi , which are sufficient to identify the         4.3. Project Visible Faces kernel
vertices of each face, as explained above.                            The second kernel reads the coordinates of the vertices of
                                                                      each visible face from the vertList in texture memory. The
                                                                      vertices ids to access the vertList are read from the tetList in
4.2. Find Visible Faces kernel
                                                                      global memory. Note that texture components can not be in-
The first kernel reads the external faces from texture mem-            dexed directly, forcing us to store the tetList in global mem-
ory and computes the parameters, i.e. plane equation coeffi-           ory, instead of texture memory, in order to avoid unnecessary
cients, for each one. This computation is called face creation        branches.

                                                                                                           submitted to Volume Graphics (2008)
A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting                      5

   The vertices coordinates are used to project the visible                   Furthermore, the computedFaces buffer stores the tetra-
faces onto the screen space. The bounding box of the pro-                  hedron id ti and the face id fi besides the face parameters.
jection is computed and stored in the projected visible faces              When a hit occurs, those indices are used to ensure that the
list (projVisFaces), which is written in global memory to be               face to be read is the hit face. If it is not, a collision occurred.
used by the next kernel.                                                   In this case, we create the new face, used it and replace the
                                                                           old one in the buffer. The original VF-Ray algorithm spends
  The second kernel uses one thread for each visible face
                                                                           more memory storing four face indices for each tetrahedron
computation. As the first kernel, this one is performed by
                                                                           to point to the computedFaces buffer.
each thread using few resources of a block. Therefore, the
number of threads per block can be maximized and the com-                     The third kernel divides the CUDA grid in blocks (see
putation time and memory footprint are unnoticeable.                       Figure 5) associating one block to one visible face. Each
                                                                           block is, in turn, divided into threads that computes a small
                                                                           set of pixels. The threads with pixels outside the visible
4.4. Ray Casting kernel                                                    face, or fewer pixels inside, finish their computation faster
The third kernel reads the visFaces list, computed in the                  than the other threads, but are not wasted. These threads are
first kernel, and the projVisFaces list, computed in the sec-               warped to other blocks, by the graphics hardware driver, in
ond kernel. The visible face pixels are determined using the               order to continue the kernel computation. Unlike the first
bounding box of the projection and computing a point in-                   two kernels, the amount of computation performed in the ray
side triangle test. This test employs only cross product oper-             traversal consumes much more resources of a block, making
ations. For each visible face pixel determined, the ray casting            unfeasible the computation of all visible face pixels by one
is triggered using the visible face as the first entry face. All            thread.
the pixels will use the face parameters stored in the visFaces
list. From this point, the algorithm travels finding each next
face and computing the ray integration using the same inte-
gration method proposed by Bunyk.
    Just like VF-Ray, we store the next faces in the 1D com-
putedFaces buffer. However, our ray casting algorithm is de-
signed to run in parallel taking advantage of the ray coher-
ence inside each thread computation. In order to avoid dy-
namic allocation, the computedFaces buffer is allocated in
CUDA local memory for each thread with a fixed size and it
is indexed by a hash table. The hash index is the z coordinate
centroid of the face. Refer to Figure 4 for more details. The
first buffer position is associated with the minimum z coor-
dinate (minZ) of the volume dataset, while the last position
with the maximum z coordinate (maxZ). Figure 4 presents
two neighboring rays processed by the same thread in se-
quence. The ray 1 creates four internal faces while the ray 2              Figure 5: The CUDA grid scheme used in our implementa-
reads them from the computedFaces buffer.                                  tion. Each block inside the grid computes one visible face,
                                                                           while each thread inside the block computes a set of pixels.



                                                                           4.5. Handling Degeneracies
                                                                           We treat the degenerate cases differently than the original
                                                                           VF-Ray. While they use the Use_set list to correctly find
                                                                           the next tetrahedron for the traversal, we perturb the ray to
                                                                           get rid of the case, then the next iteration continues with the
                                                                           same ray direction without this displacement. Our approach
                                                                           obtains approximated results, but saving the memory spent
                                                                           to keep the Use_set list.

Figure 4: The computedFaces buffer stores previous hits                    5. Results
(red circles) to be read in future hits (green crosses). The
centroid is used to hash the buffer.                                       In this section we present the results of the performance and
                                                                           memory usage of our hardware-based ray casting algorithm.

submitted to Volume Graphics (2008)
6                   A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting

The algorithm was written in CUDA language, which is an                 Dataset          # Verts   # Faces    # Boundary           # Cells
extension of a subset of C programming language. Our ex-                 Blunt            41 K      381 K         13 K             187 K
periments were conducted on an Intel Pentium Core 2 Duo                 Oxygen           109 K       1M           27 K             513 K
6400 with 2.13 GHz per processor and 2 GB RAM. We used                    F16            1.1 M     12.9 M        309 K              6.3 M
a nVidia GeForce 8800 Ultra graphics card with 768 MB.                  Fighter+         1.9 M     22.1 M        334 K             11.2 M
   The evaluation of our algorithm, that we call VF-Ray-                                    Table 1: Datasets sizes.
GPU, was done against recent hardware-based algorithms.
We also compare VF-Ray-GPU with the original version of
VR-Ray that runs solely on the CPU. The idea behind this
last comparison is to: first, how much the use of graphics
hardware can improve the performance over a CPU imple-
mentation; and second, the gain we obtained by using dif-
ferent and more efficient structures. Before evaluating our
results, we give a brief description of the hardware-based al-
gorithms used as baselines for our comparisons.


5.1. Baselines                                                                     (a)                                    (b)
The following algorithms was used for comparison:
   VICP: The View-Independent Cell Projection algorithm
(VICP) was proposed by Weiler et al. [WKME03b]. This is a
hybrid version of a cell projection algorithm. The algorithm
projects the each cell of the dataset, but the integrations in-
side the cells are performed exactly in the same way as the
ray casting does.
   HARC: The Hardware-Based Ray Casting algorithm
(HARC) was proposed by Weiler et al. [WKME03a] and                                 (c)                                    (d)
is based on Bunyk et al. [BKS97]. The idea is to store the
adjacency information in texture and compute the contribu-            Figure 6: Datasets in 5122 dimension: (a) Blunt Fin, (b)
tion of each cell by accessing a 3D texture indexed by the            Oxygen Post, c(c) Fighter, and (d) F16.
pre-integration results. The intersections are computed us-
ing faces normals and the integration uses the pre-computed
gradient for each cell.
                                                                      5.3. Small Datasets
   HARC-Partial: This algorithm [EC05] is an extension of
the HARC algorithm with partial pre-integration. The algo-            In this section we evaluate the performance and memory us-
rithm also employs an alternative data structure, in order to         age of VF-Ray-GPU against the other hardware-based im-
reduce the memory usage.                                              plementations results. We used in this evaluation only the
                                                                      small datasets, since the other algorithms could not handle
                                                                      the large datasets.
5.2. Workload
                                                                         Table 2 shows the time and memory usage results for the
We used four different datasets in our experiments: Blunt
                                                                      baseline rendering algorithms and VF-Ray-GPU, for Blunt
Fin, Oxygen Post and Fighter from NASA’s NAS website,
                                                                      and Oxygen. Each two columns for each dataset summarizes
and F16 from EADS. Screenshots of the datasets are shown
                                                                      the total memory footprint (in kilobytes) and timing (in mil-
in Figure 6. Table 1 shows the number of vertices, faces,
                                                                      liseconds).
boundary faces, i.e. external faces, and cells for each dataset.
As we can observe in this table, we used two small size                  Table 3 shows some memory usage aspects of the algo-
datasets, Blunt and Oxygen, and two large size datasets,              rithms. The number of bytes per tetrahedron needed to store
Fighter and F16. The Fighter dataset used in our experiments          the volume data (Bytes/Tet); the number of bytes per pixel,
is, in fact, a larger and more precise version of the original        that depends on the final image (Bytes/Pixel); and the total
NASA dataset, that we named Fighter+. Our results are the             number of megabytes spent in pre-integration or partial pre-
mean values for the complete rotation of all datasets. For            integration technique (Pre-Int.). These numbers indicate how
small datasets, we used a viewport of 5122 pixels, and for            the memory usage grows with respect to the dataset size, the
large datasets, we used a higher resolution, 10242 pixels, in         final image precision, and the use of pre-integration table
order to take into account more cells of these volumes.               lookup.

                                                                                                             submitted to Volume Graphics (2008)
A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting                  7

                                                             Blunt Fin                   Oxygen Post
                                       Algorithm
                                                     Memory (KB) Time (ms)         Memory (KB) Time (ms)
                                         VICP          118,524         190           249,928         546
                                         HARC           72,267          18           123,245          33
                                      HARC-Partial     22,636          32            50,248          51
                                      VF-Ray-GPU         7,029         186            19,494         370
                      Table 2: Memory and timing results for VF-Ray-GPU and other hardware-based algorithms.



      Algorithm               Bytes/Tet      Bytes/Pixel   Pre-Int         of our work is that from all hardware-based ray casting al-
        VICP                    456               –          16            gorithms, VF-Ray-GPU was the only one that could render
        HARC                    160              96          16            F16 and Fighter+. The 768 MB that the GeForce 8800 Ultra
     HARC-Partial                96              96          1             contains was not enough for the other ray casting algorithms.
     VF-Ray-GPU                  38               –           –               In Table 4 we compare VF-Ray-GPU with VF-Ray. The
    Table 3: Hardware-based algorithms memory usage.                       table shows the memory footprint (in kilobytes) and the to-
                                                                           tal execution time (in millisecond) for the two large datasets
                                                                           rendered by VF-Ray-GPU and the original VF-Ray algo-
                                                                           rithm (CPU). As we can observe in this table, in comparison
   When we compare VF-Ray-GPU with the other ray                           with the original VF-Ray, our algorithm runs about 4 times
casting algorithms, HARC and HARC with partial pre-                        faster and uses about 50% less memory. The performance
integration (HARC-Partial), we can observe that, our algo-                 gain is due to the parallel nature of our ray casting algorithm,
rithm uses only about 33% of the memory used by HARC                       while the data restructuring improves the memory usage.
with partial pre-integration and about 12% of the memory
used by the original HARC algorithm. However, VF-Ray-                                           Memory (MB)          Time (ms)
                                                                                   Datasets
GPU is slower than both hardware ray casting implementa-                                        CPU    GPU          CPU    GPU
tions. This is due mainly to the ray traversal technique.                          Fighter+     876     426        16263 3081
                                                                                     F16        499     239         2515    804
   The main difference between the two hardware ray casting
algorithms and VF-Ray-GPU resides on the ray traversal and                 Table 4: Memory footprint and timing comparison between
integration techniques. In HARC, the ray traversal is done                 VF-Ray-CPU and VF-Ray-GPU algorithms to generate im-
by means of pre-computed normals per face and gradients                    ages with the dimension of 10242 .
per tetrahedron, while ray integration is performed by the
use of pre-integration (or partial pre-integration) method to
evaluate each cell’s contribution. In VF-Ray-GPU approach,                 6. Conclusions
computations over the face’s parameters replace the use of
normals and gradients for the ray traversal and uses the vol-              In this work, we have proposed a novel hardware-based ray
ume rendering integral, which is computed on-the-fly. The                   casting algorithm, based on a memory-aware software im-
use of faces slows down the VF-Ray-GPU, but on the other                   plementation of ray casting, VF-Ray. The main contribu-
hand, it reduces by more than 50% the memory required per                  tion of our proposal was to provide ray casting in hardware
tetrahedron, compared with the two HARC algorithms. This                   graphics with low memory usage in order to handle large
reduction in the memory footprint is fundamental when deal-                datasets. Besides the low memory consumption, our algo-
ing with large volume datasets.                                            rithm also allows the handling of some degenerate cases,
                                                                           produced by the hardware implementations that are based
   When we compare VF-Ray-GPU with the hybrid algo-                        on previous ray casting approaches, like Bunyk approach.
rithm, we observe that VICP algorithm uses 16 times more
memory and is slower than VF-Ray-GPU for Oxygen. VICP                         We compared our algorithm with other recent hardware-
being a hybrid algorithm, uses all necessary data for ray cast-            based volume rendering algorithms, based on the ray casting
ing and it also employs part of the cell projection calculation,           paradigm, and on a hybrid version mixing ray casting and
what makes it the most memory consuming algorithm.                         cell projection. For smaller datasets, our algorithm is slower
                                                                           than the ray casting approaches, but uses from 77% to 90%
                                                                           less memory. When compared to the hybrid approach, our
5.4. Large Datasets                                                        algorithm is faster and uses about 95% less memory.
In this section we evaluate the performance and memory us-                   For larger datasets, our algorithm could render datasets for
age of VF-Ray-GPU to generate images in the dimension of                   which the others failed, due to the amount of memory avail-
10242 , for large datasets. One of the most important results              able in the graphics card. Our results show a linear increase

submitted to Volume Graphics (2008)
8                   A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting

in the execution time with the increase in the dataset size,            Volume visualization (New York, NY, USA, 1990), ACM
indicating that the algorithm is scalable to larger datasets. In        Press, pp. 35–40.
addition, our algorithm was faster than the original VF-Ray,          [Har04] H ARRIS M.:      GPGPU – General-Purpose
while using significantly less memory.                                   computation using Graphics Hardware, 2004.
   As future work we intend to use the thread synchroniza-              http://www.gpgpu.org/.
tion scheme of CUDA to handle non-convex meshes. An-                  [MA04] M ORELAND K., A NGEL E.: A fast high accuracy
other future work is to use the partial pre-integration tech-           volume renderer for unstructured data. In VVS ’04: Pro-
nique, in order to improve rendering quality.                           ceedings of the 2004 IEEE Symposium on Volume visual-
                                                                        ization and graphics (Piscataway, NJ, USA, 2004), IEEE
7. Acknowledgments                                                      Press, pp. 13–22.
                                                                      [MHC90] M AX N., H ANRAHAN P., C RAWFIS R.: Area
We would like to thank Claudio Silva for providing us the
                                                                        and volume coherence for efficient visualization of 3d
Fighter and F16 datasets (Fighter from Neely and Batina –
                                                                        scalar functions. In VVS’90: Proceedings of the 1990
NASA, and F16 from Udo Tremel – EADS-Military). We
                                                                        workshop on Volume visualization (New York, NY, USA,
also acknowledge the grant of the first and second authors
                                                                        1990), ACM Press, pp. 27–33.
provided by Brazilian agencies CNPq and CAPES.
                                                                      [MMFE06] M ARROQUIM R., M AXIMO A., FARIAS R.,
                                                                        E SPERANCA C.: GPU-Based Cell Projection for Interac-
References
                                                                        tive Volume Rendering. In SIBGRAPI ’06: Proceedings
[APCBRF07] A LINE P INA , C RISTIANA B ENTES , R I -                    of the XIX Brazilian Symposium on Computer Graphics
  CARDO FARIAS : Memory efficient and robust software                    and Image Processing (Los Alamitos, CA, USA, 2006),
  implementation of the raycast algorithm. In WSCG’07:                  IEEE Computer Society, pp. 147–154.
  The 15th Int. Conf. in Central Europe on Computer                   [NVI07] NVIDIA TM :           CUDA Environment –
  Graphics, Visualization and Computer Vision (2007).                   Compute     Unified    Device     Architecture, 2007.
[BKS97] B UNYK P., K AUFMAN A. E., S ILVA C. T.: Sim-                   http://www.nvidia.com/object/cuda_home.html.
  ple, fast, and robust ray casting of irregular grids. In            [RMB∗ 07] R IBEIRO S., M AXIMO A., B ENTES C.,
  Dagstuhl ’97, Scientific Visualization (Washington, DC,                O LIVEIRA A., FARIAS R.: Memory-Aware and Efficient
  USA, 1997), IEEE Computer Society, pp. 30–36.                         Ray-Casting Algorithm. In SIBGRAPI ’07: Proceedings
[BPaLDCS06] B ERNARDON F. F., PAGOT C. A., AO                           of the XX Brazilian Symposium on Computer Graphics
  L UIZ D IHL C OMBA J., S ILVA C. T.: GPU-based Tiled                  and Image Processing (Los Alamitos, CA, USA, 2007),
  Ray Casting using Depth Peeling. Journal of Graphics                  IEEE Computer Society, pp. 147–154.
  Tools 11.3 (2006), 23–29.                                           [ST90] S HIRLEY P., T UCHMAN A. A.: Polygonal ap-
[CBPS06] C ALLAHAN S. P., BAVOIL L., PASCUCCI V.,                       proximation to direct scalar volume rendering. In Pro-
  S ILVA C. T.: Progressive volume rendering of large un-               ceedings San Diego Workshop on Volume Visualization,
  structured grids. IEEE Transactions on Visualization and              Computer Graphics (1990), vol. 24(5), pp. 63–70.
  Computer Graphics 12, 5 (2006), 1307–1314.                          [WKME03a] W EILER M., K RAUS M., M ERZ M., E RTL
[CICS05] C ALLAHAN S. P., I KITS M., C OMBA J. L.,                      T.: Hardware-Based Ray Casting for Tetrahedral Meshes.
  S ILVA C. T.: Hardware-assisted visibility sorting for un-            In Proceedings of the 14th IEEE conference on Visualiza-
  structured volume rendering. IEEE Transactions on Vi-                 tion ’03 (2003), pp. 333–340.
  sualization and Computer Graphics 11, 3 (2005), 285–                [WKME03b] W EILER M., K RAUS M., M ERZ M., E RTL
    ˘¸
  âAS295.                                                               T.:    Hardware-based view-independent cell projec-
[EC05] E SPINHA R., C ELES W.: High-quality hardware-                   tion. IEEE Transactions on Visualization and Computer
  based ray-casting volume rendering using partial pre-                 Graphics 9, 2 (2003), 163–175.
  integration. In SIBGRAPI ’05: Proceedings of the XVIII              [WMFC02] W YLIE B., M ORELAND K., F ISK L. A.,
  Brazilian Symposium on Computer Graphics and Image                    C ROSSNO P.: Tetrahedral projection using vertex shaders.
  Processing (2005), IEEE Computer Society, p. 273.                     In VVS’02: Proceedings of the 2002 IEEE Symposium on
[FMS00] FARIAS R., M ITCHELL J. S. B., S ILVA C. T.:                    Volume visualization and graphics (Piscataway, NJ, USA,
  ZSWEEP: an efficient and exact projection algorithm for                2002), IEEE Press, pp. 7–12.
  unstructured volume rendering. In VVS’00: Proceedings               [WMKE04] W EILER M., M ALLON P. N., K RAUS M.,
  of the 2000 IEEE Symposium on Volume visualization                    E RTL T.: Texture-encoded tetrahedral strips. In VV ’04:
  (New York, NY, USA, 2000), ACM Press, pp. 91–99.                      Proceedings of the 2004 IEEE Symposium on Volume Vi-
[Gar90] G ARRITY M. P.: Raytracing irregular volume                     sualization and Graphics (Washington, DC, USA, 2004),
  data. In VVS’90: Proceedings of the 1990 workshop on                  IEEE Computer Society, pp. 71–78.

                                                                                                          submitted to Volume Graphics (2008)

Mais conteúdo relacionado

Mais procurados

[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsShunta Saito
 
Image super resolution based on
Image super resolution based onImage super resolution based on
Image super resolution based onjpstudcorner
 
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...IRJET Journal
 
mvitelli_ee367_final_report
mvitelli_ee367_final_reportmvitelli_ee367_final_report
mvitelli_ee367_final_reportMatt Vitelli
 
INNOVA - SPIE Remote Sensing 2019
INNOVA - SPIE Remote Sensing 2019 INNOVA - SPIE Remote Sensing 2019
INNOVA - SPIE Remote Sensing 2019 Andrea Di Pasquale
 
DETECTION OF POWER-LINES IN COMPLEX NATURAL SURROUNDINGS
DETECTION OF POWER-LINES IN COMPLEX NATURAL SURROUNDINGSDETECTION OF POWER-LINES IN COMPLEX NATURAL SURROUNDINGS
DETECTION OF POWER-LINES IN COMPLEX NATURAL SURROUNDINGScscpconf
 
34 8951 suseela g suseela paper8 (edit)new
34 8951 suseela g   suseela paper8 (edit)new34 8951 suseela g   suseela paper8 (edit)new
34 8951 suseela g suseela paper8 (edit)newIAESIJEECS
 
33 8951 suseela g suseela paper8 (edit)new2
33 8951 suseela g   suseela paper8 (edit)new233 8951 suseela g   suseela paper8 (edit)new2
33 8951 suseela g suseela paper8 (edit)new2IAESIJEECS
 
Remote Sensing IEEE 2015 Projects
Remote Sensing IEEE 2015 ProjectsRemote Sensing IEEE 2015 Projects
Remote Sensing IEEE 2015 ProjectsVijay Karan
 
Remote Sensing IEEE 2015 Projects
Remote Sensing IEEE 2015 ProjectsRemote Sensing IEEE 2015 Projects
Remote Sensing IEEE 2015 ProjectsVijay Karan
 
Image reconstruction in nuclear medicine
Image reconstruction in nuclear medicineImage reconstruction in nuclear medicine
Image reconstruction in nuclear medicineshokoofeh mousavi
 
DeepVO - Towards Visual Odometry with Deep Learning
DeepVO - Towards Visual Odometry with Deep LearningDeepVO - Towards Visual Odometry with Deep Learning
DeepVO - Towards Visual Odometry with Deep LearningJacky Liu
 
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic SegmentationReview : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic SegmentationDongmin Choi
 
Efficient Design of p-Cycles for Survivability of WDM Networks Through Distri...
Efficient Design of p-Cycles for Survivability of WDM Networks Through Distri...Efficient Design of p-Cycles for Survivability of WDM Networks Through Distri...
Efficient Design of p-Cycles for Survivability of WDM Networks Through Distri...CSCJournals
 
Review : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingReview : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingDongmin Choi
 
Omid Badretale Low-Dose CT noise reduction
Omid Badretale Low-Dose CT noise reductionOmid Badretale Low-Dose CT noise reduction
Omid Badretale Low-Dose CT noise reductionOmid Badretale
 

Mais procurados (19)

[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Image super resolution based on
Image super resolution based onImage super resolution based on
Image super resolution based on
 
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
 
mvitelli_ee367_final_report
mvitelli_ee367_final_reportmvitelli_ee367_final_report
mvitelli_ee367_final_report
 
INNOVA - SPIE Remote Sensing 2019
INNOVA - SPIE Remote Sensing 2019 INNOVA - SPIE Remote Sensing 2019
INNOVA - SPIE Remote Sensing 2019
 
DETECTION OF POWER-LINES IN COMPLEX NATURAL SURROUNDINGS
DETECTION OF POWER-LINES IN COMPLEX NATURAL SURROUNDINGSDETECTION OF POWER-LINES IN COMPLEX NATURAL SURROUNDINGS
DETECTION OF POWER-LINES IN COMPLEX NATURAL SURROUNDINGS
 
34 8951 suseela g suseela paper8 (edit)new
34 8951 suseela g   suseela paper8 (edit)new34 8951 suseela g   suseela paper8 (edit)new
34 8951 suseela g suseela paper8 (edit)new
 
33 8951 suseela g suseela paper8 (edit)new2
33 8951 suseela g   suseela paper8 (edit)new233 8951 suseela g   suseela paper8 (edit)new2
33 8951 suseela g suseela paper8 (edit)new2
 
53
5353
53
 
Remote Sensing IEEE 2015 Projects
Remote Sensing IEEE 2015 ProjectsRemote Sensing IEEE 2015 Projects
Remote Sensing IEEE 2015 Projects
 
MTP paper
MTP paperMTP paper
MTP paper
 
Remote Sensing IEEE 2015 Projects
Remote Sensing IEEE 2015 ProjectsRemote Sensing IEEE 2015 Projects
Remote Sensing IEEE 2015 Projects
 
Image reconstruction in nuclear medicine
Image reconstruction in nuclear medicineImage reconstruction in nuclear medicine
Image reconstruction in nuclear medicine
 
DeepVO - Towards Visual Odometry with Deep Learning
DeepVO - Towards Visual Odometry with Deep LearningDeepVO - Towards Visual Odometry with Deep Learning
DeepVO - Towards Visual Odometry with Deep Learning
 
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic SegmentationReview : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
 
Efficient Design of p-Cycles for Survivability of WDM Networks Through Distri...
Efficient Design of p-Cycles for Survivability of WDM Networks Through Distri...Efficient Design of p-Cycles for Survivability of WDM Networks Through Distri...
Efficient Design of p-Cycles for Survivability of WDM Networks Through Distri...
 
Review : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingReview : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-training
 
Omid Badretale Low-Dose CT noise reduction
Omid Badretale Low-Dose CT noise reductionOmid Badretale Low-Dose CT noise reduction
Omid Badretale Low-Dose CT noise reduction
 

Destaque

Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionIn a Rocket
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanPost Planner
 

Destaque (6)

Ribeiro visfaces07
Ribeiro visfaces07Ribeiro visfaces07
Ribeiro visfaces07
 
Saulo sib04
Saulo sib04Saulo sib04
Saulo sib04
 
Loop Snakes
Loop SnakesLoop Snakes
Loop Snakes
 
Loop snakesiv05
Loop snakesiv05Loop snakesiv05
Loop snakesiv05
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
 

Semelhante a Cudaray

Analog signal processing approach for coarse and fine depth estimation
Analog signal processing approach for coarse and fine depth estimationAnalog signal processing approach for coarse and fine depth estimation
Analog signal processing approach for coarse and fine depth estimationsipij
 
Analog signal processing approach for coarse and fine depth estimation
Analog signal processing approach for coarse and fine depth estimationAnalog signal processing approach for coarse and fine depth estimation
Analog signal processing approach for coarse and fine depth estimationsipij
 
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING mlaij
 
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...IRJET Journal
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...IEEEBEBTECHSTUDENTPROJECTS
 
JPM1402 An Efficient Parallel Approach for Sclera Vein Recognition
JPM1402  An Efficient Parallel Approach for Sclera Vein RecognitionJPM1402  An Efficient Parallel Approach for Sclera Vein Recognition
JPM1402 An Efficient Parallel Approach for Sclera Vein Recognitionchennaijp
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC IJECEIAES
 
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...IRJET Journal
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Dhiraj Chaudhary
 
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Dhiraj Chaudhary
 
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureCoarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureDhiraj Chaudhary
 
Coarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerCoarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerDhiraj Chaudhary
 
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...Subhajit Sahu
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Networkgerogepatton
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Networkgerogepatton
 

Semelhante a Cudaray (20)

Matlab 2013 14 papers astract
Matlab 2013 14 papers astractMatlab 2013 14 papers astract
Matlab 2013 14 papers astract
 
Analog signal processing approach for coarse and fine depth estimation
Analog signal processing approach for coarse and fine depth estimationAnalog signal processing approach for coarse and fine depth estimation
Analog signal processing approach for coarse and fine depth estimation
 
Analog signal processing approach for coarse and fine depth estimation
Analog signal processing approach for coarse and fine depth estimationAnalog signal processing approach for coarse and fine depth estimation
Analog signal processing approach for coarse and fine depth estimation
 
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
A SURVEY OF NEURAL NETWORK HARDWARE ACCELERATORS IN MACHINE LEARNING
 
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
Adaptive Neuro-Fuzzy Inference System (ANFIS) for segmentation of image ROI a...
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
 
JPM1402 An Efficient Parallel Approach for Sclera Vein Recognition
JPM1402  An Efficient Parallel Approach for Sclera Vein RecognitionJPM1402  An Efficient Parallel Approach for Sclera Vein Recognition
JPM1402 An Efficient Parallel Approach for Sclera Vein Recognition
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
 
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
 
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
 
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureCoarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architecture
 
Coarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerCoarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c router
 
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 

Último

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Cudaray

  • 1. Volume Graphics (2008), pp. 1–8 H. - C. Hege, D. Laidlaw (Editors) Memory Efficient GPU-Based Ray Casting for Unstructured Volume Rendering A. Maximo1 and S. Ribeiro1 and C. Bentes2 and A. Oliveira1 and R. Farias1 1 COPPE - Federal University of Rio de Janeiro, Brazil 2 DESC - State University of Rio de Janeiro, Brazil Abstract Volume ray casting algorithms benefit greatly with recent increase of GPU capabilities and power. In this paper, we present a novel memory efficient ray casting algorithm for unstructured grids completely implemented on GPU using a recent off-the-shelf nVidia graphics card. Our approach is built upon a recent CPU ray casting algorithm, called VF-Ray, that considerably reduces the memory footprint while keeping good performance. In addition to the implementation of VF-Ray in the graphics hardware, we also propose a restructuring in its data structures. As a result, our algorithm is much faster than the original software version, while using significantly less memory, it needed only one-half of its previous memory usage. Comparing our GPU implementation to other hardware-based ray casting algorithms, our approach used between three to ten times less memory. These results made it possible for our GPU implementation to handle larger datasets on GPU than previous approaches. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Display algorithms I.3.7 [Three-Dimensional Graphics and Realism]: Raytracing 1. Introduction cells intersected throughout the ray traversal. Every pair of intersections is used to compute a contribution for the pixel Direct volume rendering is an important technique for visu- color and opacity. In the cell projection approach, however, alizing 3-D data volumes. It provides high quality images each polyhedral cell of the volumetric data is projected onto of the interior of the volume data that have valuable con- the screen, requiring the cells to be first sorted in visibility tributions in many different areas, such as medical images order. It is done in order to correctly determine the contribu- analysis, geological data visualization, weather simulations, tion of each cell for the color and opacity values of the final or fluid dynamics interactions. image pixels. Recently, the increasing capability and performance of commercial graphics hardware, such as ATI Radeon and The great advantages of ray casting methods are that the nVidia GeForce, brought hardware implementations of vol- computation for each pixel is independent of the others, and ume rendering to gain the attention of researchers, as the traveling of a ray throughout the mesh is guided by the they allow renderers to achieve interactive frame rates. connectivity of the cells, avoiding the need of sorting the Several hardware-based algorithms have been proposed in cells. The disadvantage is that cells connectivity has to be the literature [WKME03b, WKME03a, EC05, MMFE06]. explicitly computed and kept in memory. Some approaches are based on the cell projection algo- rithm [WKME03a, MMFE06] and others on the ray casting When these two approaches are compared in terms of per- algorithm [WKME03b, EC05]. formance, for smaller datasets, the cell projection approach performs better [EC05, MMFE06]. For larger datasets, how- In the ray casting approach for unstructured grids, rays are ever, ray casting usually performs better than cell projection casted from the viewpoint through every pixel of the image, because it does not suffer from the high cost of sorting the and by line/plane intersection, it is possible to determine all data before the rendering process. submitted to Volume Graphics (2008)
  • 2. 2 A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting Although ray casting has been more efficient for large however, has to keep some large auxiliary data structures datasets, their recent implementations on the GPU present in memory. In the work by Pina et al. [APCBRF07], two high memory requirements [WKME03b, EC05]. These high new data to the ray casting algorithm were proposed, they requirements, make the GPU implementations unsuitable for achieved significant gains in memory usage. Unfortunately, handling large datasets, i.e. made of several millions of tetra- their algorithms are slower than Bunyk. The efforts in CPU- hedra. The graphics hardware memory is usually smaller based approaches continue with the recent work of Ribeiro than the CPU memory, imposing strong limitations in the et al. [RMB∗ 07] introducing the VF-Ray algorithm. In their store of cell connectivity. work, instead of keeping the information of all faces in mem- ory like Bunyk, VF-Ray keeps only a limited set of faces, In this paper, we propose a novel hardware-based ray cast- in order to bring down the memory footprint. Nonetheless, ing algorithm for tetrahedral meshes that tackles this prob- VF-Ray stores, in cell and face data structures, some indices lem. Our idea is to take advantage of a recent memory effi- for faces, which we avoid by indexing the faces in a differ- cient software implementation of ray casting, called Visible ent way, as described in section 4.1. Also, they keep a list Faces Driven Ray Casting [RMB∗ 07] – VF-Ray. This algo- of cells for each vertex to properly determine the next cell rithm uses a reduced and non-redundant data structure that when a ray hits a vertex or an edge. This data structure was provides consistent and significant gains in memory usage, replaced in our implementation by a simpler method, in or- and explores ray coherence in order to maintain in memory der to avoid this problem. All these approaches, however, only information of the most recent traversals. As VF-Ray are time-consuming and could not achieve real-time perfor- used only from 1/3 to 1/6 of the memory used by previous mance. approaches, it can deal with larger datasets. However, as VF- Ray was implemented in the CPU, interactive performance In order to achieve interactive frame rates, graph- has never been achieved. ics processing units (GPUs) have been used [WMFC02, WKME03a, EC05, BPaLDCS06, MMFE06]. In the last In order to achieve interactive performance with VF-Ray, years, the major increase in the performance and pro- our hardware implementation uses General Purpose compu- grammability of GPUs has led to significant increase in tation on GPU (GPGPU) [Har04] and the Compute Unified the ray casting performance. Weiler et al. [WKME03a] im- Device Architecture – CUDA [NVI07] from nVidia. In ad- plemented a hardware-based ray casting algorithm, that is dition to the use of graphics hardware, we propose a restruc- based on the work of Garrity et al.. This algorithm was ture in the original data structures of VF-Ray. As a result, our further extended by Espinha and Celes [EC05]. The orig- algorithm is much faster than the original software version, inal algorithm uses the pre-integration technique to eval- while using significantly less memory. uate the final color, while Espinha and Celes approach Our results show that, when our algorithm is compared to uses the partial pre-integration technique [MA04]. Bernar- other recent ray casting hardware-based algorithms, it uses don et al. [BPaLDCS06] proposes a hardware-based algo- from 77% to 90% less memory. In this way, our algorithm rithm based on Bunyk’s implementation and depth peel- could render datasets for which the others failed due to the ing to render non-convex unstructured grids. On the other lack of memory in the graphics card. hand, Weiler et al. [WMKE04] propose the use of tetrahe- dral strips, in order to deal with the problem of storing the The remainder of this paper is organized as follows. In the whole dataset in GPU memory. next section we discuss the related work. Section 3 briefly describes the VF-Ray algorithm. Section 4 describes our In the scope of projection algorithms, there are some soft- hardware ray casting algorithm and the improvements in its ware implementations [MHC90, FMS00] that provide flex- data structures. In section 5, we present the results of our ibility and easy parallelization. The GPU implementations most important experiments. Finally, in section 6, we present of projection algorithms, however, are more notorious. One our conclusions and proposals for future work. of the first hardware implementation of cell projection was proposed by Wylie et al. [WMFC02], and was based on the Projected Tetrahedra (PT) algorithm by Shirley and Tuch- 2. Related Work man [ST90]. The PT algorithm decomposes the tetrahedral Many volume rendering algorithms have been proposed cells into triangles, according to a projection class, and trans- throughout the years. Volume ray casting is the most popular mit them to the triangle rendering hardware. Wylie et al. ex- one, and several implementations have been developed. Gar- tended PT algorithm to implement all tetrahedral primitives rity [Gar90] proposes the use of the cell connectivity to com- directly into the GPU. Further, Weiler et al. [WKME03b] pute the ray traversal inside the volume. This approach was developed a combined ray casting and cell projection ap- further improved by Bunyk et al. [BKS97], where all cells proach, where the projection is done in a view-independent are broken into faces, in a preprocessing step. The Bunyk’s way. More recently, Marroquim et al. [MMFE06] proposed algorithm determine the visible faces, that when projected another improvement in the PT algorithm that was imple- on the screen, define the entry point for each pixel, for mented almost entirely on GPU and could keep the whole the ray casting through the volumetric data. This approach, model in the texture memory, avoiding the bus transfer over- submitted to Volume Graphics (2008)
  • 3. A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting 3 head. Their proposal also used partial pre-integration in or- der to improve the quality of the image. The work of Calla- han et al. [CICS05] introduced the Hardware-Assisted Visi- bility Sorting (HAVS) algorithm which simplifies the CPU- based processing and shifts much of the sorting burden to the GPU. More recently, Callahan et al. [CBPS06] used the HAVS algorithm to provide a client-server system that incre- mentally streams parts of the mesh from a server to a client. Figure 1: Ray coherence for one visible face. 3. Visible Faces Driven Ray Casting The Visible Face Driven Ray Casting algorithm, called VF- kept in the computedFaces buffer while the visible set is be- Ray [RMB∗ 07], deals with unstructured grids composed of ing computed. When all pixels in a visible set are processed, tetrahedral or hexahedral cells. Each tetrahedral cell is com- the VF-Ray algorithm clears the computedFaces buffer. posed of four faces and each hexahedral cell is composed of six faces. The algorithm is based on the following premise: 3.1. Handling Degeneracies the storing of the information about the faces of each cell is the key for memory consumption and execution time. This Degenerated situations may occur in the ray casting meth- information is stored in a face data structure, that includes ods, when a ray hits an edge or a vertex. In Bunyk approach, the geometry and the face parameters, constants for the plane the algorithm would check for the next intersection only on equation defined by the face, which is the most consuming the cells neighboring the current cell faces, and, in this way, data structure in ray casting. would not be able to continue the ray traversal, generating incorrect pixels colors. VF-Ray can handle these two situa- The basic idea behind VF-Ray is to explore ray coherence, tions by using the Use_set data structure. When a ray hits a improving caching performance by keeping in memory only vertex v, VF-Ray can find the next intersection by scanning the face data of the traversals of a set of nearby rays. The the cells belonging to the Use_set of v. When the ray hits an nearby rays, which will be casted through the neighboring edge v0 v1 , Vf-Ray can find the next intersection by scanning pixels, are under the projection of a given visible face, as the Use_set of v0 and v1 . In this way, VF-ray guarantees that shown in Figure 1. This set of pixels is called visible set. the image will be correctly generated. The algorithm first creates a list of vertices, a list of cells and, for each vertex, a list of all cells incident to the vertex, 4. Memory Efficient GPU-Based Ray Casting called the Use_set of the vertex. The rendering process starts by determining the visible faces from the external faces of Our hardware-based ray casting algorithm relies on the the volume. An external face is a face that belongs to only GPGPU implementation concept to parallelize the original one cell and is not shared by any other cell [BKS97] and a VF-Ray algorithm. The idea behind this concept is to assume visible face is an external face whose normal makes an an- the graphics hardware as a coprocessor capable of perform- gle greater than 90o with the viewing direction. The VF-Ray ing high arithmetic intensity tasks, regardless the graphics algorithm processes each visible face at a time, projecting it board specific computations. The architecture used to imple- onto the screen, determining the visible set. For each pixel ment our GPGPU ray casting algorithm was CUDA [NVI07] in this set, the ray traversal is done and each pair of inter- due to its simplicity and the possibility of exploitation of sections is computed. The internal faces are determined by all GPU resources. In this architecture, the implementation the intersections and the faces data are stored in a temporary is done through kernels, that run in multiple threads. The buffer, called computedFaces. Whenever a ray casted from threads are grouped in blocks, meaning that threads inside the current visible set hits an internal face, already stored the same block share the resources of one GPU multiproces- in the buffer, the VF-Ray algorithm just reads the face data sor. Each multiprocessor has a Single Instruction Multiple from the computedFaces buffer, without having to recom- Data (SIMD) architecture which runs the same kernel in- pute the faces parameters. The face data are used to evaluate structions but operating on different data. the traversed distance d and the entry and exit scalar values, We divide our algorithm in three steps and, as a result, in s0 and s1 respectively. Finally, the cell contribution for the three different kernels (see Figure 2). In the first kernel, the pixel color and opacity is computed using d, s0 and s1 and external faces are read and the visible faces are determined. an illumination integral. The second kernel computes the projection of each visible face, splitting them in pixels on the screen space. Finally, VF-Ray algorithm explores ray coherence, using the visi- the third kernel evaluates the ray casting algorithm over each ble face information to guide the creation and destruction of visible face pixel previously computed. face data in memory. The faces intersected by neighbor rays tend to be the same (as shown in Figure 1) and their data are The first kernel is important to reduce the computation submitted to Volume Graphics (2008)
  • 4. 4 A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting performed by the next two kernels. In the first kernel, the computation is bounded to the number of external faces, while the computation of the second and third kernels has the cost based on the number of visible faces which is ap- proximately half the number of external faces. Figure 3: The basic data structures for our GPU-Based ray casting algorithm. and targets to solve two 3x3 linear systems: one to interpo- late the z coordinate; and another to interpolate the scalar value s. In this kernel, only the z coordinate interpolation is done for each external face, once this face is checked as vis- Figure 2: The three kernels of our algorithm. ible the s value interpolation is made and both interpolation parameters are stored in the list of visible faces (visFaces). The visibility for an external face is done by comparing 4.1. Data structures the z coordinate of the fourth vertex of the tetrahedron, i.e. the vertex that does not belong to the face, with the z coordi- Before the three GPU kernels can be processed, we pre- nate of its projection on the external face. The fourth vertex compute the following data structures. From the list of ver- projection is done using the z coordinate interpolation pa- tices (vertList) and list of tetrahedra (tetList) of the volu- rameters computed for the external face. The face is visible metric dataset, we compute the tetrahedra connectivity (con- if the projected z is smaller than the z coordinate of the fourth Tet) which stores for each tetrahedron face, the tetrahedron vertex. id ti which shares the face. Figure 3 shows the data struc- ture used by our algorithm, illustrating the first tetrahedron Only the visible face parameters are written in CUDA (tetrahedron0 ). The vertList contains the x, y and z coordi- global memory, that is, the z and s interpolation parameters nates and the scalar value s of each vertex. The tetList con- are stored in the visFaces list. tains the vertices id vi which composes each tetrahedron. The first kernel employs the visible face test, i.e. back face To avoid building other data structures, we use the vertices culling, and face creation for each thread, using the maxi- order inside the tetList to determine each tetrahedron face. mum number of threads available per block. The number of The vertices of the face fi are vi , v(i+1)mod4 and v(i+2)mod4 . external faces is fixed for each volume data and thus is the For example, the face f2 of some tetrahedron ti is composed number of threads across all blocks in this kernel. The com- by the vertices v2 , v3 and v0 of ti , as can be seen in Figure 3. putation time and memory footprint spent in the first kernel In addition to these data structures, our algorithm uses the correspond to less than 5% of the total (time and memory). external faces list (extFaces) pre-computed in the same way Nevertheless, the main role of this kernel is to reduce the as Bunyk. number of threads that will be used by the next two kernels. From now on, they will run over the visFaces list instead of The original VF-Ray algorithm stores three vertices in- the extFaces list, which is about half the size of the extFaces dices inside its face data structure. In our algorithm, since list. we target to run in GPU limited memory, we store only two indices for each face created. These indices are the tetrahe- dron ti and the face fi , which are sufficient to identify the 4.3. Project Visible Faces kernel vertices of each face, as explained above. The second kernel reads the coordinates of the vertices of each visible face from the vertList in texture memory. The vertices ids to access the vertList are read from the tetList in 4.2. Find Visible Faces kernel global memory. Note that texture components can not be in- The first kernel reads the external faces from texture mem- dexed directly, forcing us to store the tetList in global mem- ory and computes the parameters, i.e. plane equation coeffi- ory, instead of texture memory, in order to avoid unnecessary cients, for each one. This computation is called face creation branches. submitted to Volume Graphics (2008)
  • 5. A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting 5 The vertices coordinates are used to project the visible Furthermore, the computedFaces buffer stores the tetra- faces onto the screen space. The bounding box of the pro- hedron id ti and the face id fi besides the face parameters. jection is computed and stored in the projected visible faces When a hit occurs, those indices are used to ensure that the list (projVisFaces), which is written in global memory to be face to be read is the hit face. If it is not, a collision occurred. used by the next kernel. In this case, we create the new face, used it and replace the old one in the buffer. The original VF-Ray algorithm spends The second kernel uses one thread for each visible face more memory storing four face indices for each tetrahedron computation. As the first kernel, this one is performed by to point to the computedFaces buffer. each thread using few resources of a block. Therefore, the number of threads per block can be maximized and the com- The third kernel divides the CUDA grid in blocks (see putation time and memory footprint are unnoticeable. Figure 5) associating one block to one visible face. Each block is, in turn, divided into threads that computes a small set of pixels. The threads with pixels outside the visible 4.4. Ray Casting kernel face, or fewer pixels inside, finish their computation faster The third kernel reads the visFaces list, computed in the than the other threads, but are not wasted. These threads are first kernel, and the projVisFaces list, computed in the sec- warped to other blocks, by the graphics hardware driver, in ond kernel. The visible face pixels are determined using the order to continue the kernel computation. Unlike the first bounding box of the projection and computing a point in- two kernels, the amount of computation performed in the ray side triangle test. This test employs only cross product oper- traversal consumes much more resources of a block, making ations. For each visible face pixel determined, the ray casting unfeasible the computation of all visible face pixels by one is triggered using the visible face as the first entry face. All thread. the pixels will use the face parameters stored in the visFaces list. From this point, the algorithm travels finding each next face and computing the ray integration using the same inte- gration method proposed by Bunyk. Just like VF-Ray, we store the next faces in the 1D com- putedFaces buffer. However, our ray casting algorithm is de- signed to run in parallel taking advantage of the ray coher- ence inside each thread computation. In order to avoid dy- namic allocation, the computedFaces buffer is allocated in CUDA local memory for each thread with a fixed size and it is indexed by a hash table. The hash index is the z coordinate centroid of the face. Refer to Figure 4 for more details. The first buffer position is associated with the minimum z coor- dinate (minZ) of the volume dataset, while the last position with the maximum z coordinate (maxZ). Figure 4 presents two neighboring rays processed by the same thread in se- quence. The ray 1 creates four internal faces while the ray 2 Figure 5: The CUDA grid scheme used in our implementa- reads them from the computedFaces buffer. tion. Each block inside the grid computes one visible face, while each thread inside the block computes a set of pixels. 4.5. Handling Degeneracies We treat the degenerate cases differently than the original VF-Ray. While they use the Use_set list to correctly find the next tetrahedron for the traversal, we perturb the ray to get rid of the case, then the next iteration continues with the same ray direction without this displacement. Our approach obtains approximated results, but saving the memory spent to keep the Use_set list. Figure 4: The computedFaces buffer stores previous hits 5. Results (red circles) to be read in future hits (green crosses). The centroid is used to hash the buffer. In this section we present the results of the performance and memory usage of our hardware-based ray casting algorithm. submitted to Volume Graphics (2008)
  • 6. 6 A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting The algorithm was written in CUDA language, which is an Dataset # Verts # Faces # Boundary # Cells extension of a subset of C programming language. Our ex- Blunt 41 K 381 K 13 K 187 K periments were conducted on an Intel Pentium Core 2 Duo Oxygen 109 K 1M 27 K 513 K 6400 with 2.13 GHz per processor and 2 GB RAM. We used F16 1.1 M 12.9 M 309 K 6.3 M a nVidia GeForce 8800 Ultra graphics card with 768 MB. Fighter+ 1.9 M 22.1 M 334 K 11.2 M The evaluation of our algorithm, that we call VF-Ray- Table 1: Datasets sizes. GPU, was done against recent hardware-based algorithms. We also compare VF-Ray-GPU with the original version of VR-Ray that runs solely on the CPU. The idea behind this last comparison is to: first, how much the use of graphics hardware can improve the performance over a CPU imple- mentation; and second, the gain we obtained by using dif- ferent and more efficient structures. Before evaluating our results, we give a brief description of the hardware-based al- gorithms used as baselines for our comparisons. 5.1. Baselines (a) (b) The following algorithms was used for comparison: VICP: The View-Independent Cell Projection algorithm (VICP) was proposed by Weiler et al. [WKME03b]. This is a hybrid version of a cell projection algorithm. The algorithm projects the each cell of the dataset, but the integrations in- side the cells are performed exactly in the same way as the ray casting does. HARC: The Hardware-Based Ray Casting algorithm (HARC) was proposed by Weiler et al. [WKME03a] and (c) (d) is based on Bunyk et al. [BKS97]. The idea is to store the adjacency information in texture and compute the contribu- Figure 6: Datasets in 5122 dimension: (a) Blunt Fin, (b) tion of each cell by accessing a 3D texture indexed by the Oxygen Post, c(c) Fighter, and (d) F16. pre-integration results. The intersections are computed us- ing faces normals and the integration uses the pre-computed gradient for each cell. 5.3. Small Datasets HARC-Partial: This algorithm [EC05] is an extension of the HARC algorithm with partial pre-integration. The algo- In this section we evaluate the performance and memory us- rithm also employs an alternative data structure, in order to age of VF-Ray-GPU against the other hardware-based im- reduce the memory usage. plementations results. We used in this evaluation only the small datasets, since the other algorithms could not handle the large datasets. 5.2. Workload Table 2 shows the time and memory usage results for the We used four different datasets in our experiments: Blunt baseline rendering algorithms and VF-Ray-GPU, for Blunt Fin, Oxygen Post and Fighter from NASA’s NAS website, and Oxygen. Each two columns for each dataset summarizes and F16 from EADS. Screenshots of the datasets are shown the total memory footprint (in kilobytes) and timing (in mil- in Figure 6. Table 1 shows the number of vertices, faces, liseconds). boundary faces, i.e. external faces, and cells for each dataset. As we can observe in this table, we used two small size Table 3 shows some memory usage aspects of the algo- datasets, Blunt and Oxygen, and two large size datasets, rithms. The number of bytes per tetrahedron needed to store Fighter and F16. The Fighter dataset used in our experiments the volume data (Bytes/Tet); the number of bytes per pixel, is, in fact, a larger and more precise version of the original that depends on the final image (Bytes/Pixel); and the total NASA dataset, that we named Fighter+. Our results are the number of megabytes spent in pre-integration or partial pre- mean values for the complete rotation of all datasets. For integration technique (Pre-Int.). These numbers indicate how small datasets, we used a viewport of 5122 pixels, and for the memory usage grows with respect to the dataset size, the large datasets, we used a higher resolution, 10242 pixels, in final image precision, and the use of pre-integration table order to take into account more cells of these volumes. lookup. submitted to Volume Graphics (2008)
  • 7. A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting 7 Blunt Fin Oxygen Post Algorithm Memory (KB) Time (ms) Memory (KB) Time (ms) VICP 118,524 190 249,928 546 HARC 72,267 18 123,245 33 HARC-Partial 22,636 32 50,248 51 VF-Ray-GPU 7,029 186 19,494 370 Table 2: Memory and timing results for VF-Ray-GPU and other hardware-based algorithms. Algorithm Bytes/Tet Bytes/Pixel Pre-Int of our work is that from all hardware-based ray casting al- VICP 456 – 16 gorithms, VF-Ray-GPU was the only one that could render HARC 160 96 16 F16 and Fighter+. The 768 MB that the GeForce 8800 Ultra HARC-Partial 96 96 1 contains was not enough for the other ray casting algorithms. VF-Ray-GPU 38 – – In Table 4 we compare VF-Ray-GPU with VF-Ray. The Table 3: Hardware-based algorithms memory usage. table shows the memory footprint (in kilobytes) and the to- tal execution time (in millisecond) for the two large datasets rendered by VF-Ray-GPU and the original VF-Ray algo- rithm (CPU). As we can observe in this table, in comparison When we compare VF-Ray-GPU with the other ray with the original VF-Ray, our algorithm runs about 4 times casting algorithms, HARC and HARC with partial pre- faster and uses about 50% less memory. The performance integration (HARC-Partial), we can observe that, our algo- gain is due to the parallel nature of our ray casting algorithm, rithm uses only about 33% of the memory used by HARC while the data restructuring improves the memory usage. with partial pre-integration and about 12% of the memory used by the original HARC algorithm. However, VF-Ray- Memory (MB) Time (ms) Datasets GPU is slower than both hardware ray casting implementa- CPU GPU CPU GPU tions. This is due mainly to the ray traversal technique. Fighter+ 876 426 16263 3081 F16 499 239 2515 804 The main difference between the two hardware ray casting algorithms and VF-Ray-GPU resides on the ray traversal and Table 4: Memory footprint and timing comparison between integration techniques. In HARC, the ray traversal is done VF-Ray-CPU and VF-Ray-GPU algorithms to generate im- by means of pre-computed normals per face and gradients ages with the dimension of 10242 . per tetrahedron, while ray integration is performed by the use of pre-integration (or partial pre-integration) method to evaluate each cell’s contribution. In VF-Ray-GPU approach, 6. Conclusions computations over the face’s parameters replace the use of normals and gradients for the ray traversal and uses the vol- In this work, we have proposed a novel hardware-based ray ume rendering integral, which is computed on-the-fly. The casting algorithm, based on a memory-aware software im- use of faces slows down the VF-Ray-GPU, but on the other plementation of ray casting, VF-Ray. The main contribu- hand, it reduces by more than 50% the memory required per tion of our proposal was to provide ray casting in hardware tetrahedron, compared with the two HARC algorithms. This graphics with low memory usage in order to handle large reduction in the memory footprint is fundamental when deal- datasets. Besides the low memory consumption, our algo- ing with large volume datasets. rithm also allows the handling of some degenerate cases, produced by the hardware implementations that are based When we compare VF-Ray-GPU with the hybrid algo- on previous ray casting approaches, like Bunyk approach. rithm, we observe that VICP algorithm uses 16 times more memory and is slower than VF-Ray-GPU for Oxygen. VICP We compared our algorithm with other recent hardware- being a hybrid algorithm, uses all necessary data for ray cast- based volume rendering algorithms, based on the ray casting ing and it also employs part of the cell projection calculation, paradigm, and on a hybrid version mixing ray casting and what makes it the most memory consuming algorithm. cell projection. For smaller datasets, our algorithm is slower than the ray casting approaches, but uses from 77% to 90% less memory. When compared to the hybrid approach, our 5.4. Large Datasets algorithm is faster and uses about 95% less memory. In this section we evaluate the performance and memory us- For larger datasets, our algorithm could render datasets for age of VF-Ray-GPU to generate images in the dimension of which the others failed, due to the amount of memory avail- 10242 , for large datasets. One of the most important results able in the graphics card. Our results show a linear increase submitted to Volume Graphics (2008)
  • 8. 8 A. Maximo & S. Ribeiro & C. Bentes & A. Oliveira & R. Farias / Memory Efficient GPU Ray Casting in the execution time with the increase in the dataset size, Volume visualization (New York, NY, USA, 1990), ACM indicating that the algorithm is scalable to larger datasets. In Press, pp. 35–40. addition, our algorithm was faster than the original VF-Ray, [Har04] H ARRIS M.: GPGPU – General-Purpose while using significantly less memory. computation using Graphics Hardware, 2004. As future work we intend to use the thread synchroniza- http://www.gpgpu.org/. tion scheme of CUDA to handle non-convex meshes. An- [MA04] M ORELAND K., A NGEL E.: A fast high accuracy other future work is to use the partial pre-integration tech- volume renderer for unstructured data. In VVS ’04: Pro- nique, in order to improve rendering quality. ceedings of the 2004 IEEE Symposium on Volume visual- ization and graphics (Piscataway, NJ, USA, 2004), IEEE 7. Acknowledgments Press, pp. 13–22. [MHC90] M AX N., H ANRAHAN P., C RAWFIS R.: Area We would like to thank Claudio Silva for providing us the and volume coherence for efficient visualization of 3d Fighter and F16 datasets (Fighter from Neely and Batina – scalar functions. In VVS’90: Proceedings of the 1990 NASA, and F16 from Udo Tremel – EADS-Military). We workshop on Volume visualization (New York, NY, USA, also acknowledge the grant of the first and second authors 1990), ACM Press, pp. 27–33. provided by Brazilian agencies CNPq and CAPES. [MMFE06] M ARROQUIM R., M AXIMO A., FARIAS R., E SPERANCA C.: GPU-Based Cell Projection for Interac- References tive Volume Rendering. In SIBGRAPI ’06: Proceedings [APCBRF07] A LINE P INA , C RISTIANA B ENTES , R I - of the XIX Brazilian Symposium on Computer Graphics CARDO FARIAS : Memory efficient and robust software and Image Processing (Los Alamitos, CA, USA, 2006), implementation of the raycast algorithm. In WSCG’07: IEEE Computer Society, pp. 147–154. The 15th Int. Conf. in Central Europe on Computer [NVI07] NVIDIA TM : CUDA Environment – Graphics, Visualization and Computer Vision (2007). Compute Unified Device Architecture, 2007. [BKS97] B UNYK P., K AUFMAN A. E., S ILVA C. T.: Sim- http://www.nvidia.com/object/cuda_home.html. ple, fast, and robust ray casting of irregular grids. In [RMB∗ 07] R IBEIRO S., M AXIMO A., B ENTES C., Dagstuhl ’97, Scientific Visualization (Washington, DC, O LIVEIRA A., FARIAS R.: Memory-Aware and Efficient USA, 1997), IEEE Computer Society, pp. 30–36. Ray-Casting Algorithm. In SIBGRAPI ’07: Proceedings [BPaLDCS06] B ERNARDON F. F., PAGOT C. A., AO of the XX Brazilian Symposium on Computer Graphics L UIZ D IHL C OMBA J., S ILVA C. T.: GPU-based Tiled and Image Processing (Los Alamitos, CA, USA, 2007), Ray Casting using Depth Peeling. Journal of Graphics IEEE Computer Society, pp. 147–154. Tools 11.3 (2006), 23–29. [ST90] S HIRLEY P., T UCHMAN A. A.: Polygonal ap- [CBPS06] C ALLAHAN S. P., BAVOIL L., PASCUCCI V., proximation to direct scalar volume rendering. In Pro- S ILVA C. T.: Progressive volume rendering of large un- ceedings San Diego Workshop on Volume Visualization, structured grids. IEEE Transactions on Visualization and Computer Graphics (1990), vol. 24(5), pp. 63–70. Computer Graphics 12, 5 (2006), 1307–1314. [WKME03a] W EILER M., K RAUS M., M ERZ M., E RTL [CICS05] C ALLAHAN S. P., I KITS M., C OMBA J. L., T.: Hardware-Based Ray Casting for Tetrahedral Meshes. S ILVA C. T.: Hardware-assisted visibility sorting for un- In Proceedings of the 14th IEEE conference on Visualiza- structured volume rendering. IEEE Transactions on Vi- tion ’03 (2003), pp. 333–340. sualization and Computer Graphics 11, 3 (2005), 285– [WKME03b] W EILER M., K RAUS M., M ERZ M., E RTL ˘¸ âAS295. T.: Hardware-based view-independent cell projec- [EC05] E SPINHA R., C ELES W.: High-quality hardware- tion. IEEE Transactions on Visualization and Computer based ray-casting volume rendering using partial pre- Graphics 9, 2 (2003), 163–175. integration. In SIBGRAPI ’05: Proceedings of the XVIII [WMFC02] W YLIE B., M ORELAND K., F ISK L. A., Brazilian Symposium on Computer Graphics and Image C ROSSNO P.: Tetrahedral projection using vertex shaders. Processing (2005), IEEE Computer Society, p. 273. In VVS’02: Proceedings of the 2002 IEEE Symposium on [FMS00] FARIAS R., M ITCHELL J. S. B., S ILVA C. T.: Volume visualization and graphics (Piscataway, NJ, USA, ZSWEEP: an efficient and exact projection algorithm for 2002), IEEE Press, pp. 7–12. unstructured volume rendering. In VVS’00: Proceedings [WMKE04] W EILER M., M ALLON P. N., K RAUS M., of the 2000 IEEE Symposium on Volume visualization E RTL T.: Texture-encoded tetrahedral strips. In VV ’04: (New York, NY, USA, 2000), ACM Press, pp. 91–99. Proceedings of the 2004 IEEE Symposium on Volume Vi- [Gar90] G ARRITY M. P.: Raytracing irregular volume sualization and Graphics (Washington, DC, USA, 2004), data. In VVS’90: Proceedings of the 1990 workshop on IEEE Computer Society, pp. 71–78. submitted to Volume Graphics (2008)