Game Development Reference
are “interesting” for the fragments. We therefore did not attempt to simulate an
LRU cache, as suggested by Ragan-Kelley et al. On a current GPU there can
be several thousand fragments rasterized in parallel, thus the global cache should
also be able to hold a similar magnitude of samples to achieve a good hit rate.
In a na ıve implementation, a thread could query the buffer tail, and check the
last N items in the CG-buffer. Of course the latency of the iterative memory
accesses would be prohibitively high. We now show a cache implementation that
performs cache lookups with only one buffer load and an atomic lock.
The concept of the shading grid already creates indices for shading samples
that are growing approximately linearly with the rasterized fragments. Thus the
FIFO cache could also be implemented by simply storing the last N ssID values.
Consequently, instead of linearly searching in a global buffer, we could introduce
a bucketed hash array. The full implementation of our optimized algorithm is
shown in Listing 3.4.
The hash function ( hashSSID ) is a simple modulo operation with the number
of buckets. This evenly distributes queries from rasterized fragments over the
buckets, which is important to minimize cache collisions (when threads with
different ssIDs compete for the same bucket). In case of a cache miss, multiple
threads compete for storing the shading samples in the same bucket, therefore
we use a per-bucket locking mechanism ( uBucketLocks ). Note that we try to
minimize the number of instructions between obtaining and releasing a lock: the
computation of a shading sample does not happen inside the critical section, but
we only set the needStore flag to perform the storage later.
As the execution order of fragments is nondeterministic, there is no guarantee
that all threads obtain the lock in a given number of steps. In practice we
have very rarely experienced cases when the fragment shader execution froze for
starving fragment shaders. While we hope this will change on future architectures,
we have limited the spinlock iterations, and in case of failure the fragment shader
falls back to storing the shading sample without shading reuse. In our experiments
this only happened to a negligible fraction of fragments.
One other interesting observation was that if the number of cache buckets is
high enough, we can really severely limit the bucket size. As the reader can see in
the source code, a bucket stores only a single uvec4 element, which corresponds to
two shading samples: a cache entry is a tuple of an ssID and a memory address.
This is a very important optimization, because instead of searching inside the
cache, we can look up any shading sample with a single load operation using its
In our first implementation, each bucket in the cache has been stored as
a linked list of shading sample addresses, similarly to the per-pixel linked list
algorithm of [Yang et al. 10]. When we have made experiments to measure the
necessary length of this list, we have found that in most cases even a single element
per bucket is sucient, and we did not have any cache misses when we considered
only two elements per bucket. This is why we could discard the expensive linked