Correct Hybrid FLOP Calculation#4508
Conversation
Signed-off-by: Philip Petrakian <ppetrakian@nvidia.com> (cherry picked from commit d3165d6)
Signed-off-by: Philip Petrakian <ppetrakian@nvidia.com>
|
/ok to test e6f0afe |
Review: Correct Hybrid FLOP CalculationThe math for splitting attention FLOPs into projection (linear in seq_len) and core-attention (quadratic) terms looks correct and is consistent with the transformer path. The SWA accounting properly mirrors the existing transformer SWA logic. Good parity test coverage. Two items to clarify:
No perf tests impacted. |
Signed-off-by: Philip Petrakian <ppetrakian@nvidia.com>
|
/ok to test 9551726 |
Summary
This PR splits the HybridModel TFLOPs/MFU accounting fix out of the GPT-OSS HybridModel migration.
HybridModel commonly represents one logical decoder block as multiple physical hybrid layers, for example
*Efor attention followed by MoE. Bridge's generic FLOP accounting needs to count those physical symbols as the corresponding logical attention, MLP, and MoE work. Without that, GPT-OSS-style*Elayouts can report roughly doubled TFLOPs/MFU even when runtime throughput is unchanged.Changes:
hybrid_layer_patternthrough hybrid FLOP accountingget_hybrid_layer_counts(),parse_hybrid_pattern(), andSymbolshelpers for HybridModel pattern handlingseqlen_squared_sumis honored