Skip to content

WebGPU fragment shader optimization#8733

Open
cabanier wants to merge 1 commit into
playcanvas:mainfrom
cabanier:webgpu_fragment_shader_optimization
Open

WebGPU fragment shader optimization#8733
cabanier wants to merge 1 commit into
playcanvas:mainfrom
cabanier:webgpu_fragment_shader_optimization

Conversation

@cabanier
Copy link
Copy Markdown

Emit WebGPU fragment builtins only when the processed source references the corresponding pc* globals. This avoids carrying unused front-facing, primitive-index, position, and sample-index inputs through material fragment shaders; in particular sample_index is no longer requested unless a shader actually needs pcSampleIndex.

Refactor the WGSL clustered-light hot path to avoid mutating a large ClusterLightData through ptr helper calls. Core light decode now returns value data, and optional spot, area, shadow, cookie, and omni-atlas data is decoded into smaller values at the point of use to reduce register pressure and potential function-memory spills.

Emit WebGPU fragment builtins only when the processed source references the corresponding pc* globals. This avoids carrying unused front-facing, primitive-index, position, and sample-index inputs through material fragment shaders; in particular sample_index is no longer requested unless a shader actually needs pcSampleIndex.

Refactor the WGSL clustered-light hot path to avoid mutating a large ClusterLightData through ptr<function> helper calls. Core light decode now returns value data, and optional spot, area, shadow, cookie, and omni-atlas data is decoded into smaller values at the point of use to reduce register pressure and potential function-memory spills.
@cabanier
Copy link
Copy Markdown
Author

The fragment shader overhead was mostly from generated WGSL asking the compiler to carry data the shader did not actually use.

Before the fix, the WGSL processor always emitted these fragment inputs/globals for WebGPU:

@Builtin(position) position : vec4f,
@Builtin(front_facing) frontFacing : bool,
@Builtin(sample_index) sampleIndex : u32,
@Builtin(primitive_index) primitiveIndex : u32,

and copied them into private globals:

pcPosition = input.position;
pcFrontFacing = input.frontFacing;
pcSampleIndex = input.sampleIndex;
pcPrimitiveIndex = input.primitiveIndex;

For the material shaders we compared, only pcPosition was used for fog. pcFrontFacing, pcPrimitiveIndex, and usually pcSampleIndex were dead plumbing. In particular, sample_index can be expensive because requesting it may force sample-rate fragment shading on MSAA targets, which is much more work than pixel-rate shading.

The clustered-light path also had WGSL-specific overhead: it decoded light data into a large ClusterLightData local and passed it through helpers as ptr<function, ClusterLightData>. That makes the hot per-light loop look like mutable function-memory traffic to the compiler. It can increase register pressure or cause spills, especially because the struct included fields only needed by optional spot/shadow/cookie/area paths.

The fix was:

  • emit position, front_facing, sample_index, and primitive_index only when the final fragment source references pcPosition, pcFrontFacing, pcSampleIndex,
    or pcPrimitiveIndex;
  • change clustered-light helpers to return smaller value structs/vectors instead of mutating a large pointer-passed ClusterLightData;
  • remove the half precision conversion churn, which was adding lots of half(...), half3(...), and f32(...) casts around ordinary lighting math.

@cabanier cabanier changed the title Webgpu fragment shader optimization WebGPU fragment shader optimization May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant