Game Development Reference
In-Depth Information
D 1
D 1
D 1
D 1
D 1
D 1
D 0
C 1
D 0
C 1
C 1
D 0
C 1
D 0
C 1
B 1
D 0
C 1
C 0
C 0
C 0
A 1
C 0
C 0
B 1
B 1
B 1
D 0
B 1
B 1
B 0
A 1
B 0
C 0
A 1
B 0
A 1
B 0
A 1
B 0
B 0
A 1
A 0
A 0
A 0
A 0
A 0
A 0
Figure 7.10. Bitonic merge sort. Thread A compares A 0 and A 1 , and so on.
We generalize the idea of an odd-even transition sort to develop our block
transition sort, which is suited for architectures like the GPU that have a fast local
memory for a set of threads. Instead of comparing two adjacent index pairs, it
compares two adjacent blocks consisting of several elements. Precisely, it sorts
two adjacent blocks in a step. Figure 7.9 illustrates how the block transition sort
works. Block transition sort is good for a GPU, which has fast local memory
for each processor, because partitioning the computation into small problems lets
threads sort two adjacent blocks only on the fast local memory without writing
back to the slower global memory. Also, the memory-access pattern is preferable,
because all the random access can be done on the chip memory so that all the read
and write operations can be aligned memory accesses.
In our implementation, we used bitonic merge sort for sorting two adjacent
blocks. As shown in Figure 7.10, bitonic merge sort always compares n/ 2 sets
of entries in a pass, where n is the total number of elements. So n/ 2 threads are
executed, and each of them reads two elements to shared memory. Then it repeats
comparison and synchronization until sorting is done. It is important to set the
size of a block such that two blocks can fit in shared memory.
If we have more budget for the sorting, the two adjacent sorted chunks of data
could be merged by using merge sort to make it much more efficient.
7.4.5
Performance
Figure 7.11 shows a simulation that sorts particle values. A box half-filled with
particles is rotated. To make the effect of sorting illustrative, particles are colored
by their indices. We can see that these colors do not mix up, although particles are
mixed up. This is because of the renumbering of particles. The simulation times
on a GPU are shown in Figure 7.12. The figure shows total computation time of a
 
Search Nedrilad ::




Custom Search