A self-attention variant where the query projection Q = W_q(x) is
replaced with a learnable per-position table Q[t, h]. Keys and
values stay content-based.
The intuition is simple: probe experiments on small Transformers suggest that the query side of attention does little content-specific work in early training. SQT formalises that observation as an architecture and measures, per parameter scale, how far the simplification carries.
Attention layers spend three projections per token: Q, K, V.
The Q projection is what makes attention conditional on the
queryer. If you can replace it with a fixed positional lookup
without losing perplexity, you save:
- one
(d_model, d_model)matmul per layer per forward, - the
q_projparameter block, - a noticeable fraction of inference latency.
You also get a cleaner model to reason about: every position has a fixed query "role", and content variation flows entirely through keys and values. Whether that is enough depends on the size of the model. This repo gives a number.
Trained from scratch on Tiny Shakespeare (character-level, sequence length 256, batch 64, 2000 optimiser steps, single seed) on an RTX 3050. Full reproduction commands below.
| size | params | baseline ppl | SQT ppl | Δ ppl | inference |
|---|---|---|---|---|---|
| tiny | 305 k | 11.61 | 12.00 | +3.4 % | -8.3 % faster |
| small | 2.2 M | 8.47 | 11.49 | +35 % | unchanged |
| medium | 7.0 M | 7.80 | 9.59 | +23 % | -16 % faster |
At ~300 k parameters, SQT matches the baseline to within 3 % ppl and runs noticeably faster. Above that, full SQT is not a drop-in replacement: the content-based query carries real signal at scale.
The clean read-out is the boundary itself: attention layers tolerate static queries up to a model size we can now point at.
- Edge / embedded LMs (≤ ~1 M params). SQT is a strict win on inference time at a small perplexity tax. The smaller the model, the cleaner the trade.
- Layer-wise hybrids.
HybridLM(cfg, n_sqt_layers=N)keeps the firstNlayers SQT and the rest standard. Runpython -m bench.layer_sweepto find the largestNthat doesn't hurt perplexity for your size; this is often the practical speed/quality knob, not "all SQT or none". - Architectural baselines. If you study attention, SQT is a cheap reference point for what the query projection is actually buying you at your scale.
git clone https://github.com/narelabs/sqt
cd sqt
pip install -e .Requires torch >= 2.1. Tiny Shakespeare downloads automatically on
first benchmark run (~1 MB).
Full size sweep (≈8 minutes on RTX 3050):
python -m bench.sweep --sizes tiny small medium --steps 2000Layer-wise hybrid sweep on the small configuration:
python -m bench.layer_sweep --size small --steps 2000JSON outputs land in bench/results/, summary tables print to
stdout. Smoke tests:
pytest -qimport torch
from sqt import LMConfig, BaselineLM, SQTLM, HybridLM
cfg = LMConfig(
vocab_size=65,
d_model=256, n_heads=4, n_layers=4,
d_ff=512, max_seq_len=256,
)
baseline = BaselineLM(cfg) # standard Transformer
sqt_full = SQTLM(cfg) # all layers SQT
hybrid = HybridLM(cfg, n_sqt_layers=2) # first 2 layers SQT, rest baseline
x = torch.randint(0, 65, (1, 64))
print(baseline(x).shape) # (1, 64, 65)sqt/
├─ src/sqt/
│ ├─ __init__.py
│ └─ model.py # BaselineLM, SQTLM, HybridLM
├─ bench/
│ ├─ sweep.py # size sweep
│ ├─ layer_sweep.py # layer-wise SQT mix
│ └─ results/*.json # generated outputs
├─ tests/ # pytest smoke tests
├─ docs/
│ ├─ design.md # architecture and rationale
│ └─ results.md # current numbers
├─ README.md
├─ LICENSE # Apache-2.0
└─ CITATION.cff
The numbers above come from a single seed at one training budget on
character-level Tiny Shakespeare. They are reproducible end-to-end
with the commands shown. Multi-seed runs, BPE corpora, and larger
models are natural extensions; pull requests with additional
configurations are welcome and will be folded into docs/results.md.
Apache-2.0.