WebGPU fragment shader optimization#8733
Conversation
Emit WebGPU fragment builtins only when the processed source references the corresponding pc* globals. This avoids carrying unused front-facing, primitive-index, position, and sample-index inputs through material fragment shaders; in particular sample_index is no longer requested unless a shader actually needs pcSampleIndex. Refactor the WGSL clustered-light hot path to avoid mutating a large ClusterLightData through ptr<function> helper calls. Core light decode now returns value data, and optional spot, area, shadow, cookie, and omni-atlas data is decoded into smaller values at the point of use to reduce register pressure and potential function-memory spills.
|
The fragment shader overhead was mostly from generated WGSL asking the compiler to carry data the shader did not actually use. Before the fix, the WGSL processor always emitted these fragment inputs/globals for WebGPU: @Builtin(position) position : vec4f, and copied them into private globals: pcPosition = input.position; For the material shaders we compared, only pcPosition was used for fog. pcFrontFacing, pcPrimitiveIndex, and usually pcSampleIndex were dead plumbing. In particular, sample_index can be expensive because requesting it may force sample-rate fragment shading on MSAA targets, which is much more work than pixel-rate shading. The clustered-light path also had WGSL-specific overhead: it decoded light data into a large ClusterLightData local and passed it through helpers as ptr<function, ClusterLightData>. That makes the hot per-light loop look like mutable function-memory traffic to the compiler. It can increase register pressure or cause spills, especially because the struct included fields only needed by optional spot/shadow/cookie/area paths. The fix was:
|
Emit WebGPU fragment builtins only when the processed source references the corresponding pc* globals. This avoids carrying unused front-facing, primitive-index, position, and sample-index inputs through material fragment shaders; in particular sample_index is no longer requested unless a shader actually needs pcSampleIndex.
Refactor the WGSL clustered-light hot path to avoid mutating a large ClusterLightData through ptr helper calls. Core light decode now returns value data, and optional spot, area, shadow, cookie, and omni-atlas data is decoded into smaller values at the point of use to reduce register pressure and potential function-memory spills.