Game Development Reference
buffers might not belong to any visible surfaces. Even filling up the z-buffer in a
depth prepass might not solve the problem: if early z-testing is disabled a z-culled
fragment can still write data into a uniform image buffer. We therefore execute
another pass that marks visible shading samples, and optionally removes invisible
data from the CG-buffer.
Visibility. Marking visible samples is surprisingly easy. After the sampling stage is
finished, we render a full-screen quad with subsample fragment shader execution,
and each fragment shader stores a visibility flag corresponding to its shading
sample. There is no synchronization needed, as each thread stores the same value.
To evaluate the quality of shading reuse, we used a variant of this technique,
which counts visibility samples per-shading sample. In the diagnostics code we
atomically increment a per-shading sample counter for each subsample in the
framebuffer. The heatmap visualizations in this article were generated using this
Compaction. Because of the rasterization order, there is no explicit bound on
the size of the compact buffers. Using the visibility flags, we can perform a
stream compaction on the shading data before shading. Besides ecient memory
footprint this also increases the execution coherence during the shading process.
In this article we do not provide implementation details for shading, as it is
orthogonal to our decoupling method. The final pixel colors are evaluated by
rendering a full-screen quad and gathering all shaded colors for each visibility
sample. This is the same behavior as the resolve pass of a standard multisampled
framebuffer, except for the location of subsample colors.
In this section we discuss possible application of our method in deferred rendering.
While current GPU architectures do not have hardware support for decoupled
sampling, the overhead of our global cache management can be amortized by the
reduction of shader evaluations. We focus on stochastic sampling, a rendering
problem especially challenging for deferred shading.
While the software overhead of decoupled sampling makes our method rather
interactive than real time, we demonstrate significant speedup for scenes with
complex shading. All images in this article were rendered at 1 , 280
on an Nvidia GTX580 GPU and Intel Core i7 920 CPU.
Adaptive shading. We have computed the average shading rate of these images,
to roughly estimate the shading speedup compared to supersampled deferred
shading. We save further computation by reducing the density of the shading
grid of blurry surfaces. Our adaptive shading rate implementation is only a
proof-of-concept based entirely on empirically chosen factors. For better quality,