Takahiro HARADA

“Forward+: Bringing Deferred Lighting to the Next Level” by Takahiro Harada, Jay McKee, and Jason C. Yang (AMD):

🔧 Overview

Forward+ is a hybrid rendering technique that combines the flexibility of forward rendering with the scalability of deferred lighting. It introduces a GPU-based light culling stage to efficiently handle scenes with many dynamic lights, overcoming the limitations of both traditional forward and deferred rendering.

🧱 Motivation

Limitations of Traditional Techniques:

Forward Rendering:
- Limited number of lights due to shader permutation explosion.
- Poor scalability with dynamic lighting.
- Good material flexibility and transparency handling.
Deferred Rendering:
- Efficient with many lights.
- High memory bandwidth and storage requirements (G-buffers).
- Poor support for complex materials and transparency.
- Limited anti-aliasing support.

Goal:

To create a rendering pipeline that:

Supports many lights.
Maintains material and lighting model flexibility.
Reduces memory traffic compared to deferred lighting.

🛠️ Forward+ Pipeline

Forward+ extends the forward rendering pipeline with a light culling stage using GPU compute shaders. The pipeline consists of:

Depth Prepass:
- Renders scene depth to reduce overdraw.
- Essential for minimizing cost in final shading.
Light Culling:
- Screen is divided into tiles (e.g., 16×16 pixels).
- For each tile, a list of overlapping lights is computed.
- Implemented entirely on the GPU using compute shaders.
Final Shading:
- Each pixel accesses the list of lights for its tile.
- Full material and light data are available for accurate shading.

💡 Light Culling Techniques

Two GPU-based implementations are described:

1. Gather Approach:

One compute shader per tile.
Each thread checks if a light overlaps the tile’s frustum.
Overlapping lights are stored in thread-local storage and then written to global memory.
Simple and efficient for a moderate number of lights.

2. Scatter Approach:

One thread per light.
Each light determines which tiles it overlaps.
Writes light-tile pairs to a buffer.
Buffer is sorted by tile index (e.g., radix sort).
More scalable for large numbers of lights.

📊 Performance & Memory Analysis

Theoretical Memory Traffic:

Forward+ avoids writing full-screen G-buffers.
Writes only light indices per tile.
Reads light indices and light data during final shading.
Deferred lighting reads/writes more data (normals, depth, light accumulation).

Key Formula:

To determine when Forward+ is more efficient:

\[M < 15 \cdot \frac{1 + (1 + L)T}{T}\]

Where:

( M ): average lights per tile
( L ): light data size (e.g., 8 floats for point lights)
( T ): tile size (e.g., 16×16)

Experimental Results:

Tested on AMD Radeon HD 6970 and A8-3510MX (integrated GPU).
Forward+ outperformed compute-based deferred lighting in total rendering time.
Performance gap widened under lower memory bandwidth conditions.
Well-suited for tile-based rendering architectures (e.g., mobile GPUs).

🎯 Advantages of Forward+

Material Flexibility: Supports complex BRDFs and physically-based shading.
Lower Memory Bandwidth: Especially beneficial on bandwidth-constrained GPUs.
Scalability: Efficiently handles thousands of dynamic lights.
GPU-Only Pipeline: No CPU-side light management needed.

🔮 Future Work

Dynamic Tile Sizing: Automatically optimize tile size based on scene complexity.
Shadowing: Integrate per-light shadow computation (e.g., via ray casting).
Further Optimization: Explore spatial subdivision for better light culling accuracy.