Delta Attention Fast and Accurate Sparse Attention Inference by Delta Correction

Introduction

We found that sparse attention has a problem which hurts performance. The key-sparse attention causes a distirbutional shift in the attention outputs. As the queries of layer i+1 depend on the attention outputs of layer i, this means that even if one were to use a sparse attention prefill and a dense attention decode, the decode may fail to match the proper keys for a given query due to the distributional shift.

Delta Attention solves a problem by performing query-sparse (and key-dense) attention for a small subset of query tokens in addition to the query-dense (and key-sparse) sparse attention method. We then take the difference between the query sparse output and the sparse attention output. The difference is then repeated for all missing queries and summed together with the key-sparse attention. The result is an attention output that is closer in cosine similarity to the full quadratic attention with minimal added overhead

For more details, please have a look at our paper here

Usage

We provide a simple implementation with an openai server here. To run the server, execute the following commands

pip install -r requirements.txt
chmod +x ./run-server-hf.sh
./run-server-hf.sh

run-server-hf.sh calls server_hf.py which starts a simple openai style server at the port specified in run-server-hf.sh. The arguments for server_hf.py can be changed according to the following.

usage: server_hf.py [-h] [--model-str MODEL_STR] [--attn-implementation ATTN_IMPLEMENTATION] [--mode MODE] [--hip-attn-args HIP_ATTN_ARGS] [--port PORT] [--host HOST]
                    [--no-trust-remote-code] [--delta-lambda DELTA_LAMBDA] [--sliding-window SLIDING_WINDOW]

options:
  -h, --help            show this help message and exit
  --model-str MODEL_STR
  --attn-implementation ATTN_IMPLEMENTATION
  --mode MODE
  --hip-attn-args HIP_ATTN_ARGS
  --port PORT
  --host HOST
  --no-trust-remote-code
  --delta-lambda DELTA_LAMBDA
  --sliding-window SLIDING_WINDOW

Citation

``@inproceedings{willette2025delta,
  title     = {Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction},
  author    = {Willette, Jeffrey and Lee, Heejun and Hwang, Sung Ju},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2025},
  month     = {December},
  eprint    = {2505.11254},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url       = {https://arxiv.org/abs/2505.11254}
}`

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
delta_attention		delta_attention
figures		figures
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run-server-hf.sh		run-server-hf.sh
server_hf.py		server_hf.py
test_input.txt		test_input.txt
test_request.py		test_request.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Delta Attention Fast and Accurate Sparse Attention Inference by Delta Correction

Introduction

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Delta Attention Fast and Accurate Sparse Attention Inference by Delta Correction

Introduction

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages