Train a character-level language model from scratch on WikiText-103. Submissions are scored by lowest energy to reach 0.7 character prediction accuracy. Scorer is agnostic to backprop. Submissions of novel learning algorithms are encouraged!
Prerequisites:
- Modal account.
- Python 3.11
# From wip-wikitext/
python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
modal token new
python submit.py submissions/modded_nanogpt| Date | Energy (J) | Val char-acc | GPU | Config | Submission | Contributor |
|---|---|---|---|---|---|---|
| 2026-05-12 | 51,704 | 0.7374 | A100 80GB PCIe | modded_nanogpt | dir | @KellerJordan |
| 2026-05-18 | 46,222 | 0.7238 | A100 80GB PCIe | lwta_k4 | dir | @ab-10 |
| 2026-05-18 | 46,132 | 0.7146 | A100 80GB PCIe | lwta_k2 | dir | @ab-10 |
| 2026-05-18 | 55,459 | DQ | A100 80GB SXM4 | hyena | dir | @ab-10 |
| 2026-05-18 | 20,348 | DQ | A100 80GB SXM4 | pointer_sentinel | dir | @ab-10 |
| 2026-05-18 | 3,612 | DQ | A100 80GB PCIe | chunker_d1 | dir | @ab-10 |
| 2026-05-18 | 735 | DQ | A100 80GB PCIe | ppm_c | dir | @ab-10 |
| 2026-05-17 | 70 | DQ | A100 80GB SXM4 | P2-A_random_projection | dir | @ab-10 |
Train a character-level language model from scratch on WikiText-103-raw-v1. Submissions that meet the constraints below are ranked by training energy (joules), lower wins. Greedy-argmax char-accuracy is computed on the first 60,000 chars of each split; val is gated by rule 5, test is reported alongside but not gated.
Submissions must:
- Train from scratch. (No pre-trained weights — WikiText overlaps WebText, so pre-trained init poisons the comparison.)
- Use the standard WikiText-103 train/valid/test split. (You can change batch size, sequence length, attention structure, etc.; just don't change the underlying streams of characters.)
- Expose a streaming next-character distribution via the
CharModelAPI. (The runner callspredict()for positionistrictly beforeobserve()commits the ground-truth at positioni— within-document future-peeking is structurally impossible.) a. ImplementingCharModelABC fromwikitext.pyis the most straightforward way to do this. - Finish training in < 300 s wall-clock on the pinned Modal A100-80GB PCIe, measured from the first call into
train()to its return. (Eval is not charged against this budget.) - Attain val char-acc ≥ 0.70 on the first 60,000 chars of the val split.
The char-level scoring contract (rules 2 + 3) constrains the eval-facing interface, not the model's internal representation. Submissions should pick whichever internal unit (bytes, BPE, WordPiece, words, ...) best optimises their architecture's energy-to-accuracy ratio, and translate to per-char probabilities at the CharModel boundary.
- Allowed. Internal BPE / WordPiece / word vocabulary; subword-aware decoders; char-level marginalisation over partial-token continuations; hybrid architectures that mix per-char and per-token computation.
- Encouraged where it helps. Sub-quadratic mixers (Hyena, Mamba, etc.) benefit from shorter sequences; BPE typically gives 4–5× shorter context vs bytes, which can lift the per-step budget enough for those architectures to actually converge inside 300 s.
- Tokeniser merge tables (GPT-2 BPE, SentencePiece, etc.) are not "pretrained weights". They are deterministic algorithms over byte/codepoint streams, allowed under rule 1.
- Pretrained tokeniser model weights are not allowed — e.g., shipping a SentencePiece model that was trained on external data — same rationale as rule 1.
For an internal-BPE submission, predict() returns P(next_char | observed_chars) by marginalising over BPE tokens consistent with the current byte buffer: for an active partial buffer p and candidate next char c, P(c) = Σ_t P(t | committed_context) · 1[bytes(t) starts with p+c], normalised over the candidate set whose tokens start with p.