Game Development Reference
not perform well on the GPU. Instead, sorting networks, such as bitonic merge
sort, are preferred because of their parallel nature [Kipfer and Westermann 05].
However, the drawback of sorting networks is that they require lots of passes to
complete the sorting. Recently, the functionality of the GPU has made it possible
to implement radix sort, which requires fewer passes [Grand 07].
Although the radix sort runs quickly on the GPU, the sorting cost of the radix
sort is prohibitively expensive for the sole purpose of improving cache efficiency.
Actually, it took more than the computation of one step on DEM simulation in
our experiment. So it does not meet our goal. There are several sorting algo-
rithms suited for a nearly sorted list, such as insertion sort. They are good for
situations with temporal coherency between frames, like our simulation. But the
problem here is that an insertion sort is a completely sequential algorithm, which
is not good for multiple processors, such as GPUs. But what we want is not a
completion of a sort in a frame because the sort is used just to increase the spatial
coherency of the data. Even if a sort in a time step improves the order of the lists
more or less, it would improve the cache efficiency.
7.4.4 Block Transition Sort
Among sorting networks, we have chosen an odd-even transition sort, a sorting
network that completes a sort by repeating two simple operations: comparing ad-
jacent odd-even index pairs and flipping them if they are in the wrong order, then
comparing adjacent even-odd index pairs. If blocks with an arrow in Figure 7.9
are thought of as two adjacent elements, it shows how the sorting works. Odd-
even transition sort is good for a nearly sorted list but is pretty poor when it is
applied to a random list. If only two adjacent elements are flipped, it can com-
plete the sort in one or two steps. But if they are arranged in the reverse order, it
takes n steps to move them to the correct order.
Figure 7.9. Block transition sort. An array is divided into blocks, and two adjacent blocks
are sorted in a pass.