Game Development Reference
Our method was implemented using C++ and NVIDIA CUDA 1.1 on a PC
equipped with an Intel Core2 Q6600 CPU, a GeForce8800GT GPU, and Tesla
S870 GPUs. The program executes five CPU threads: it uses one GPU for render-
ing and the other four GPUs for the computation of the particle-based simulation.
A CPU thread managing a GPU executes kernels for the GPU.
Figure 7.14 is a comparison of the computation times of the simulation shown
in Figure 7.15, changing the number of GPUs. A simulation using one mil-
lion particles takes about 95 ms for a simulation step using one GPU, while the
same simulation takes about 40 ms and 25 ms on two GPUs and four GPUs,
respectively. Although these timings include the management of particles and the
data transfer time between GPUs, they are nearly scaling to the number of proces-
sors. The efficiency of parallelization decreases for the simulation on four GPUs
compared to the simulation on two GPUs. This is because it is necessary to com-
municate with one other GPU when using two GPUs, but communication with
two adjacent GPUs is necessary when using four GPUs. The timing, excluding
the time for data transfer, are also shown in Figure 7.14. These time only ex-
clude the actual data transfer between processors but include the time to manage
data. From this figure, we can see that the overhead of data management is small
enough that the performance is scaling well to the number of GPUs.
Number of Particles
Figure 7.14. Comparison of simulation times using up to four GPUs (see Color Plate VIII).