Enable Fused Kernels by Default for Memory Efficiency#6832
Conversation
There was a problem hiding this comment.
Code Review
This pull request changes the default value of use_fused_kernels from False to True across several configuration files. The reviewer identified a critical issue where enabling this by default will cause a runtime crash (NotImplementedError) if calculate_sum_pi_squared is enabled, and suggested adding a validation check in ActorConfig.__post_init__ to gracefully disable fused kernels with a warning in that case.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| checkpoint: CheckpointConfig = field(default_factory=CheckpointConfig) | ||
| optim: OptimizerConfig = field(default_factory=OptimizerConfig) | ||
| use_fused_kernels: bool = False | ||
| use_fused_kernels: bool = True |
There was a problem hiding this comment.
Enabling use_fused_kernels by default will cause a runtime crash (NotImplementedError) for any configuration that has calculate_sum_pi_squared: True enabled.
Both FSDPEngineWithLMHead (in verl/workers/engine/fsdp/transformer_impl.py) and MegatronEngineWithLMHead (in verl/workers/engine/megatron/transformer_impl.py) explicitly raise NotImplementedError when both calculate_sum_pi_squared and use_fused_kernels are True because fused kernels do not materialize the full logits tensor needed for Sigma pi^2.
To prevent this crash and provide a graceful fallback, please add a check in ActorConfig.__post_init__ to automatically disable use_fused_kernels with a warning when calculate_sum_pi_squared is enabled.
if self.calculate_sum_pi_squared and self.use_fused_kernels:
import warnings
warnings.warn(
"calculate_sum_pi_squared=True is not supported with use_fused_kernels=True. "
"Automatically disabling use_fused_kernels to allow Sigma pi^2 computation.",
UserWarning
)
self.use_fused_kernels = False|
@0hujun please fix pre-commit |
Summary
This PR proposes changing the default value of
use_fused_kernelsfromFalsetoTrueacross all engine backends (FSDP2, Megatron, VeOmni, AutoModel). Fused kernels provide significant memory savings (32x reduction in logits memory) and enable longer context training without sacrificing correctness. The change includes graceful fallback logic for incompatible configurations.Motivation
The Problem
Currently,
use_fused_kernelsdefaults toFalsein three config classes:HFModelConfigverl/workers/config/model.pyFalseActorConfigverl/workers/config/actor.pyFalseEngineConfigverl/workers/config/engine.pyFalseAnd in the canonical YAML config:
This means users must explicitly opt-in to fused kernels, and most users are unaware of this feature, resulting in:
The Benefits
Fused kernels avoid materializing the full logits tensor by computing log_probs and entropy directly from hidden_states and vocab_weights:
Real-world impact (Qwen3.5-9B, 32K context, Ascend910 61GB NPU):