Game Development Reference
In-Depth Information
particles whose indices are in n/m
a< ( i +1) n/m . Then the bounding box
of the j th slice in the i th job is
B ij, max
B ij, min =min a∈P ij {
b a }
b a }
=max a∈P ij {
,
,
B ij, max
b a }
B ij, min =min a∈P ij {
b a }
=max a∈P ij {
,
,
b a
where P ij =
. One job is processed by
a block of threads on CUDA. Since the bounding values are frequently read and
updated, they can be stored quickly on chip memory if available. On CUDA,
shared memory is used for their storage.
However, the updating of the bounding values has to be serialized in case of
write conflicts among threads. Therefore, whenever a thread updates a bounding
volume, it has to be locked. This kills the performance when a large number of
threads are running at the same time. To increase the efficiency of the computa-
tion, one job is split into smaller jobs and threads in a block are also divided into
smaller thread groups. The computation can be much more efficient because these
smaller thread groups calculate their own bounding volume data by synchronizing
fewer numbers of threads. Figure 7.7 illustrates a three-step computation of the
bounding volumes.
We will look more closely at the implementation of this computation on a
current GPU. Reducing the size of data is a good idea in most cases because the
latency of the memory access is much higher than are the arithmetic instructions.
Also, the chip resources that can be used in computation are limited. To maximize
the efficiency of the GPU, the register and shared-memory usage should be kept
to a minimum. The size of the shared memory on an NVIDIA G80 is 16 KB per
{
a
|
= j, n/m
a< ( i +1) n/m
}
Particle Positions
1st Step
2nd Step
Block0
Block1
Block2
3rd Step
Figure 7.7. Computation of bounding volumes.
 
Search Nedrilad ::




Custom Search