SQT — Static-Q Transformer

A self-attention variant where the query projection Q = W_q(x) is replaced with a learnable per-position table Q[t, h]. Keys and values stay content-based.

The intuition is simple: probe experiments on small Transformers suggest that the query side of attention does little content-specific work in early training. SQT formalises that observation as an architecture and measures, per parameter scale, how far the simplification carries.

Why this exists

Attention layers spend three projections per token: Q, K, V. The Q projection is what makes attention conditional on the queryer. If you can replace it with a fixed positional lookup without losing perplexity, you save:

one (d_model, d_model) matmul per layer per forward,
the q_proj parameter block,
a noticeable fraction of inference latency.

You also get a cleaner model to reason about: every position has a fixed query "role", and content variation flows entirely through keys and values. Whether that is enough depends on the size of the model. This repo gives a number.

Results

Trained from scratch on Tiny Shakespeare (character-level, sequence length 256, batch 64, 2000 optimiser steps, single seed) on an RTX 3050. Full reproduction commands below.

size	params	baseline ppl	SQT ppl	Δ ppl	inference
tiny	305 k	11.61	12.00	+3.4 %	-8.3 % faster
small	2.2 M	8.47	11.49	+35 %	unchanged
medium	7.0 M	7.80	9.59	+23 %	-16 % faster

At ~300 k parameters, SQT matches the baseline to within 3 % ppl and runs noticeably faster. Above that, full SQT is not a drop-in replacement: the content-based query carries real signal at scale.

The clean read-out is the boundary itself: attention layers tolerate static queries up to a model size we can now point at.

What to use this for

Edge / embedded LMs (≤ ~1 M params). SQT is a strict win on inference time at a small perplexity tax. The smaller the model, the cleaner the trade.
Layer-wise hybrids. HybridLM(cfg, n_sqt_layers=N) keeps the first N layers SQT and the rest standard. Run python -m bench.layer_sweep to find the largest N that doesn't hurt perplexity for your size; this is often the practical speed/quality knob, not "all SQT or none".
Architectural baselines. If you study attention, SQT is a cheap reference point for what the query projection is actually buying you at your scale.

Install

git clone https://github.com/narelabs/sqt
cd sqt
pip install -e .

Requires torch >= 2.1. Tiny Shakespeare downloads automatically on first benchmark run (~1 MB).

Reproduce

Full size sweep (≈8 minutes on RTX 3050):

python -m bench.sweep --sizes tiny small medium --steps 2000

Layer-wise hybrid sweep on the small configuration:

python -m bench.layer_sweep --size small --steps 2000

JSON outputs land in bench/results/, summary tables print to stdout. Smoke tests:

pytest -q

Use the model

import torch
from sqt import LMConfig, BaselineLM, SQTLM, HybridLM

cfg = LMConfig(
    vocab_size=65,
    d_model=256, n_heads=4, n_layers=4,
    d_ff=512, max_seq_len=256,
)

baseline = BaselineLM(cfg)               # standard Transformer
sqt_full = SQTLM(cfg)                    # all layers SQT
hybrid   = HybridLM(cfg, n_sqt_layers=2) # first 2 layers SQT, rest baseline

x = torch.randint(0, 65, (1, 64))
print(baseline(x).shape)                 # (1, 64, 65)

Repository layout

sqt/
├─ src/sqt/
│  ├─ __init__.py
│  └─ model.py          # BaselineLM, SQTLM, HybridLM
├─ bench/
│  ├─ sweep.py          # size sweep
│  ├─ layer_sweep.py    # layer-wise SQT mix
│  └─ results/*.json    # generated outputs
├─ tests/               # pytest smoke tests
├─ docs/
│  ├─ design.md         # architecture and rationale
│  └─ results.md        # current numbers
├─ README.md
├─ LICENSE              # Apache-2.0
└─ CITATION.cff

Scope

The numbers above come from a single seed at one training budget on character-level Tiny Shakespeare. They are reproducible end-to-end with the commands shown. Multi-seed runs, BPE corpora, and larger models are natural extensions; pull requests with additional configurations are welcome and will be folded into docs/results.md.

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bench		bench
docs		docs
src/sqt		src/sqt
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQT — Static-Q Transformer

Why this exists

Results

What to use this for

Install

Reproduce

Use the model

Repository layout

Scope

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SQT — Static-Q Transformer

Why this exists

Results

What to use this for

Install

Reproduce

Use the model

Repository layout

Scope

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages