Game Development Reference
After indices of lights overlapping each tile are calculated in the light-culling stage,
ray-cast jobs are created and accumulated in a job buffer by iterating through
all the screen pixels. This is a screen-space computation in which a thread is
executed for a pixel and goes through the list of lights. If a pixel overlaps a light,
a ray-cast job is created. To create a ray in the ray-casting stage, we need a pixel
index to obtain surface position and normal, and a light index against which the
ray is cast. These two indices are packed into a 32-bit value and stored in the
After creating all the ray-cast jobs in a buffer, we dispatch a thread for each
ray-cast job. Then it does not have the issue of uneven load balancing we expe-
rience when rays are cast in a pixel shader. Each thread is casting a ray. After
identifying whether a shadow ray is blocked, the information has to be stored
somewhere to pass to a pixel shader. We focused only on a hard shadow, which
means the output from a ray cast is a binary value. Therefore, we have packed
results from 32 rays into one 32-bit value.
But in a scene with hundreds of lights, storing a mask for all of them takes
too much space even after the compression. We took advantage of the fact that
we have a list of lights per tile; masks for lights in the list of a tile are only stored.
We limit the number of rays to be cast per pixel to 128, which means the mask
can be encoded as an int4 value. At the ray-casting stage, the result is written
to the mask of the pixel using an atomic OR operation to flip the assigned bit.
After separating ray casting from pixel shading, we can keep the final shading
almost the same in Forward+. We only need to read the shadow mask for each
pixel; whenever a light is processed, the mask is read to get the occlusion.
Results. Figure 5.11 is a screenshot of a scene with 512 shadow-casting lights.
We can see legs of chairs are casting shadows from many dynamic lights in the
scene. The screen resolution was 1,280
720. The number of rays cast for this
scene was more than 7 million. A frame computation time is about 32 ms on
an AMD Radeon HD 7970 GPU. G-pass and light culling took negligible time
compared to ray-cast job creation and ray casting, each of which took 11.57 ms
and 19.91 ms for this frame. This is another example of hybrid ray-traced and
We have presented Forward+, a rendering pipeline that adds a GPU compute-
based light-culling stage to the traditional forward-rendering pipeline to handle
many lights while keeping the flexibility for material usage. We also presented
the implementation detail of Forward+ using DirectX 11, and its performance.
We described how the Forward+ rendering pipeline is extended to use an indirect
illumination technique in the AMD Leo Demo.