Game Development Reference
multiprocessor. If eight bits are used for the bounding values, and assuming there
are 256 cells in each direction at most, a set of bounding boxes requires 1 KB
of storage. (Of course we can use 32 bits for a bounding value, but it strains the
local resources and results in less usage of hardware threads.) Therefore, we can
calculate at most 16 sets of bounding boxes by the same number of small thread
groups in a block at the same time. The computation of the bounding volumes is
done by reading particle values from main memory with aligned memory access
and updating the values using synchronization in the thread group. This corre-
sponds to the first step in Figure 7.7. The next step is reduction of these sets of
values. These outputs are still placed in shared memory, and one set of bounding
boxes is calculated from in the same kernel by assigning a thread to a bounding
box that reads all the bounding values from the smaller groups. In short, we have
256 slices, 256 threads run at the merge step, and thread i assigned to the i th slice
compares all the values of the slices from all the small groups. Then threads write
the values to the global memory at the last merge step.
The last merge step runs in another kernel. This step is almost the same as the
previous merge except for reading the values off chip memory this time. Instead
of using a thread to reduce a slice, tree-shaped reduction, in which n/ 2 threads
areassignedtoaslice( n is the number of bounding boxes) and reduce two values
to one in a step is used; it has an advantage in performance. This is the third
step in Figure 7.7. In this way, a set of bounding boxes is calculated from all the
When using CUDA for the computation, the number of real threads running
is not equal to the width of a kernel. In most cases, it is smaller than the kernel
width. Although we can make the kernel the same block size as the number of real
threads, increasing the size of blocks makes the computation much more efficient
because it makes the threads switch between work groups (like multithreading on
the CPU when a work group is stalled).
Now that we have the bounding boxes for all the slices, the number of voxels
in a slice is calculated. This computation is tricky when using shaders, but with
the function of synchronization among threads in a block, it has become easier.
The prefix sum of the array is calculated in parallel to get the indices of the first
voxels. For this parallel reduction, a method presented in [Harris et al. 07] is used.
Figure 7.8 shows how much the sliced grid improves the memory efficiency
in a test simulation. It compares the memory consumption of the uniform grid,
octree, and sliced grid in a DEM simulation. We can see that the sliced grid can
reduce the memory consumption greatly over the uniform grid, and the efficiency
is close to the octree. Moreover, the cost of accessing the voxel data is at least as
cheap as the uniform grid, and can be much better, as will be shown later.