Skip to content

narelabs/SQT

Repository files navigation

SQT — Static-Q Transformer

A self-attention variant where the query projection Q = W_q(x) is replaced with a learnable per-position table Q[t, h]. Keys and values stay content-based.

The intuition is simple: probe experiments on small Transformers suggest that the query side of attention does little content-specific work in early training. SQT formalises that observation as an architecture and measures, per parameter scale, how far the simplification carries.

Why this exists

Attention layers spend three projections per token: Q, K, V. The Q projection is what makes attention conditional on the queryer. If you can replace it with a fixed positional lookup without losing perplexity, you save:

  • one (d_model, d_model) matmul per layer per forward,
  • the q_proj parameter block,
  • a noticeable fraction of inference latency.

You also get a cleaner model to reason about: every position has a fixed query "role", and content variation flows entirely through keys and values. Whether that is enough depends on the size of the model. This repo gives a number.

Results

Trained from scratch on Tiny Shakespeare (character-level, sequence length 256, batch 64, 2000 optimiser steps, single seed) on an RTX 3050. Full reproduction commands below.

size params baseline ppl SQT ppl Δ ppl inference
tiny 305 k 11.61 12.00 +3.4 % -8.3 % faster
small 2.2 M 8.47 11.49 +35 % unchanged
medium 7.0 M 7.80 9.59 +23 % -16 % faster

At ~300 k parameters, SQT matches the baseline to within 3 % ppl and runs noticeably faster. Above that, full SQT is not a drop-in replacement: the content-based query carries real signal at scale.

The clean read-out is the boundary itself: attention layers tolerate static queries up to a model size we can now point at.

What to use this for

  • Edge / embedded LMs (≤ ~1 M params). SQT is a strict win on inference time at a small perplexity tax. The smaller the model, the cleaner the trade.
  • Layer-wise hybrids. HybridLM(cfg, n_sqt_layers=N) keeps the first N layers SQT and the rest standard. Run python -m bench.layer_sweep to find the largest N that doesn't hurt perplexity for your size; this is often the practical speed/quality knob, not "all SQT or none".
  • Architectural baselines. If you study attention, SQT is a cheap reference point for what the query projection is actually buying you at your scale.

Install

git clone https://github.com/narelabs/sqt
cd sqt
pip install -e .

Requires torch >= 2.1. Tiny Shakespeare downloads automatically on first benchmark run (~1 MB).

Reproduce

Full size sweep (≈8 minutes on RTX 3050):

python -m bench.sweep --sizes tiny small medium --steps 2000

Layer-wise hybrid sweep on the small configuration:

python -m bench.layer_sweep --size small --steps 2000

JSON outputs land in bench/results/, summary tables print to stdout. Smoke tests:

pytest -q

Use the model

import torch
from sqt import LMConfig, BaselineLM, SQTLM, HybridLM

cfg = LMConfig(
    vocab_size=65,
    d_model=256, n_heads=4, n_layers=4,
    d_ff=512, max_seq_len=256,
)

baseline = BaselineLM(cfg)               # standard Transformer
sqt_full = SQTLM(cfg)                    # all layers SQT
hybrid   = HybridLM(cfg, n_sqt_layers=2) # first 2 layers SQT, rest baseline

x = torch.randint(0, 65, (1, 64))
print(baseline(x).shape)                 # (1, 64, 65)

Repository layout

sqt/
├─ src/sqt/
│  ├─ __init__.py
│  └─ model.py          # BaselineLM, SQTLM, HybridLM
├─ bench/
│  ├─ sweep.py          # size sweep
│  ├─ layer_sweep.py    # layer-wise SQT mix
│  └─ results/*.json    # generated outputs
├─ tests/               # pytest smoke tests
├─ docs/
│  ├─ design.md         # architecture and rationale
│  └─ results.md        # current numbers
├─ README.md
├─ LICENSE              # Apache-2.0
└─ CITATION.cff

Scope

The numbers above come from a single seed at one training budget on character-level Tiny Shakespeare. They are reproducible end-to-end with the commands shown. Multi-seed runs, BPE corpora, and larger models are natural extensions; pull requests with additional configurations are welcome and will be folded into docs/results.md.

License

Apache-2.0.

About

Static-Q Transformer: replace self-attention's content-based query projection with a learnable per-position table. Empirical study of where this works and where it breaks across model sizes.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages