Skip to content

[TLE] [ENFLAME] support TLE features#602

Merged
zhzhcookie merged 28 commits into
triton_v3.6.xfrom
enflame/update_0520
May 22, 2026
Merged

[TLE] [ENFLAME] support TLE features#602
zhzhcookie merged 28 commits into
triton_v3.6.xfrom
enflame/update_0520

Conversation

@baoqiliu
Copy link
Copy Markdown
Collaborator

@baoqiliu baoqiliu commented May 20, 2026

[ENFLAME] Support TLE features for GCU backend
Enable TLE (Triton Language Extension) framework support on GCU backend,
including TLE-Raw MLIR EDSL, GCU Dialect EDSL (TOPS), compiler pipeline
integration, GCU300/GCU400C++ backend upgrades, and comprehensive
test coverage.

TLE-Lite现状

Python DSL 状态
tle.load(is_async=True) ✅已支持
tle.cumsum ✅已支持
tle.extract_tile ✅已支持
tle.insert_tile ✅已支持
tle.device_mesh ❌未支持
tle.shardingtle.S tle.Ptle.B ❌未支持
tle.make_sharded_tensortle.ShardedTensor ❌未支持
tle.shard_id ❌未支持
tle.distributed_barrier ❌未支持
tle.remote ❌未支持
tle.reshard ❌未支持
tle.distributed_dot ❌未支持

TLE-Struct (GPU) 现状

Python DSL 状态
tle.gpu.alloc ✅已支持
tle.gpu.local_ptr ✅已支持
tle.gpu.copy (TMA) ✅已支持
tle.gpu.memory_space ✅已支持
tle.gpu.pipeline ❌未支持

Test Result:
====================================== 4708 passed, 3482 skipped, 58 warnings in 2920.80s (0:48:40) =======================================

1. TLE Framework & GCU Dialect EDSL

1.1 TLE-Raw TOPS Runtime (new module: tle/raw/tops/) and Pass Support

- Add gcu_dialect.py: GCU MLIR dialect EDSL with native !gcu.ptr<T>
  type support, gcu.ptr2memref, vector.maskedload patterns
- Add mlir_runtime.py: MLIR-based runtime for GCU dialect IR execution
- Add runtime.py: TOPS C++ kernel integration runtime
- Extend codegen.py and runtime.py to support GCU backend dispatch
- Add gcu300.py, gcu400.py, mlir.py: typed Python wrappers for
  C++ MLIR passes via libtriton_gcu{300,400}_core
- Integrate TLE-raw pre-passes (TleToTritonGCU, ConvertArgToMemDesc,
  RemoveRedundantCopy, DSLRegionInline) with has_tle_raw detection
- Add gcu_intrinsics.py for GCU intrinsic function mapping
  Support IntrinsicsGCU/IntrinsicsTCLE data files for GCU400

1.2 TLE-Struct/Lite Core Pass Support

- Support TleToTritonGCU.cpp: TLE-to-TritonGCU conversion
  Phase 1 - Barrier Insertion
  Phase 2 - Lower TLE ops to TritonGCU ops (tle.local_pointers, ttg.tma_copy, tle.extract_ptr)

2. GCU Backend

2.1 Conversion Passes (Optimized)

- Rewrite ReduceOpToGCU.cpp: improved reduce lowering
- Rewrite ScanOpToGCU.cpp: expanded scan support
- Enhance TTSmemLowerToGCU.cpp: shared memory lowering
- Rework TritonToGCU.cpp: main conversion pipeline
- Add TritonGCULocalMemOptimize.cpp: local memory optimization pass
- Add ReduceScanCommon: shared reduce/scan utilities
- Add Vectorize.cpp/h: vectorization support
- Add TritonToTritonGPU conversion for GCU400
- Enhance PtrAnalysis, MaskAnalysis, FirstLastUserAnalysis

2.2 Python Bindings (new)

- Add triton_gcu400_core.cpp/h: C++ pass registration module
- Add triton_gcu400_module.cpp: Python pybind11 bindings
- Export pass functions via triton_gcu400_core.map

3. Tests

3.1 TLE-Raw Tests (Enflame, 13 new files)

- MLIR EDSL: vector-add, softmax, gcu_opt_lowering, ptr2memref
- TOPS EDSL: vector-add, softmax, matrix-multiplication, dialect IR

3.2 TLE Unit Tests (Enflame, 4 new files, 7 files updated)

- test_tle_raw_tops, test_extract/insert_tile_static_index
- DeepSeek V32 sparse-MLA tutorial
- Update test_tle_cumsum, test_tle_gpu_local_ptr for GCU compatibility
- Update tutorials (01-fft, 03-topk, deepseek_v32/01-topk_selector)

Comment thread python/tutorials/tle/01-fft.py Outdated
@zhzhcookie zhzhcookie changed the title [ENFLAME] support TLE features [TLE] [ENFLAME] support TLE features May 22, 2026
@zhzhcookie zhzhcookie merged commit 5742ad4 into triton_v3.6.x May 22, 2026
10 checks passed
@zhzhcookie zhzhcookie deleted the enflame/update_0520 branch May 22, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants