- Estimated release date: - [x] public preview (*alpha*): 9/1 - [x] public preview (*beta*): 9/30 - [x] refactor kernels & TeSA: 10/15 ## P0 - [x] Tuning more steps to show the speedup gain of the pytorch sparse modules - [x] Support the openai kernel/template - [x] code review - [x] Usage Interface(8.19) (update one version on 8.26) - [x] Fix triton speed(8.19) - [x] Sparse Softmax Kernel - [x] Biased OpenAI MatMul Kernel - [x] finegrained 99% + block size 8x8 95% + block size 32 x 32 - [x] Documentation (test) - [x] package data (test) - [x] sparta.tune(): hook, set search space - [x] Fix sparse softmax - [x] Integration test/example: Linear, Softmax - [x] Fix JIT latency - [x] Read the docs - [x] SparTA DDS MatMul kernel - [x] Batch MatMul & Softmax - [x] Sparse Attention - [x] Add sparse matmul kernel: transpose_A - [x] Functional - [x] Support backward - [x] Add perfermance test: Compare with Triton 1.1.2 (Upload test scripts) - [x] Test current tuner - [x] Test Sparse Attention - [x] Update kernel pycuda interface - [x] Profile Layout converting - [x] Construct sparse attention op with linear & softmax ops - [ ] Beta version: docs, docstrings & examples - [x] Test on V100; backward - [x] Fix kernel output - [ ] Module tuner: get combined search space of connected ops automatically - [ ] Connect to NNI's new tuner ## P1 - [ ] Apply roller's rules - [ ] Support multi-process tuning - [ ] BCSR kernel: convert(), inverse(), swapaxes(), sum(), rebuild TeSA Converter when set_mask() - [ ] Auto converter: support value mask in matmul kernels - [ ] PyCUDA device context register & operator.to() (multiple cards) - [x] Support the multiple sparse formats: sdd dsd, dds for linear - [ ] Support the block quantization kernel/fp16/bf16 - [ ] Compare Sparse Softmax with Triton's Sparse Softmax and keep improving. - [x] unit tests - [x] Model tuning interface / documents / examples - [ ] Common mask patterns - [ ] Refactor TeSA (Meta, linter) - [ ] Fuse layout converting into kernels ## P2 - [ ] Support the offline LUT or the kernel cache/DB
P0
P1
P2