Game Development Reference
CG-buffer itself could be directly used as a global memoization cache, it would
be very inecient to search for shading samples directly in it, especially that a
cached entry is only relevant for the currently rasterized primitives in flight.
Shading and resolving stages. The collected samples in the compact buffers are
then shaded using GPU compute kernels. These kernels only execute for shading
samples that are marked visible (see the next section). Finally each visibility
sample can gather its final color value in a full-screen pass. This method trivially
extends to an arbitrary number of render targets, supporting ecient shading
reuse for multiview rasterization as well.
In this section we focus on how to implement decoupled deferred shading on a
modern GPU. In our examples we provide OpenGL Shading Language (GLSL)
source code snippets. We use global atomics and scatter operations, therefore
a minimum version of OpenGL 4.2 is required for our application. The imple-
mentation could also be done in DirectX 11.1, which supports unordered access
binding to all shader stages.
The primary problem for our example is the lack of hardware support for
decoupled shading reuse, which is an architectural limitation. The hardware
least recently used (LRU) cache assigned to each rasterizer unit. Of course, every
component of the pipeline (ultimately even rasterization) can be implemented in
software, but only with reduced performance compared to dedicated hardware.
From now on we assume that our renderer still uses the hardware rasterizer,
though this technique could be also integrated into a full software implementation,
such as [Laine and Karras 11].
3.4.1 Architectural Considerations
Note that the implementation of decoupled sampling for a forward renderer would
be very inecient on current GPUs. First, using hardware rasterization, we can
only simulate the caching behavior from fragment shaders. Unfortunately we
cannot prevent the execution of redundant shaders, like the proposed architecture
of [Ragan-Kelley et al. 11] does. The rasterizer will launch fragment shaders for
each covered pixel or subsample and we can only terminate redundant instances
afterwards. This introduces at least one new code path into the shading code,
breaking its coherency.
The second problem is how to avoid redundant shading. Shading reuse can
be regarded as an election problem : shader instances corresponding to the same
shading sample must elect one instance that will evaluate the shading, the others
need to wait for the result. This can only be solved using global synchronization,