Skip to content

Mikeore/lumi-arch-research

Repository files navigation

LUMI-Arch Research Notes

Status Stage License

Independent AI architecture research focused on compact language models, compression pressure, and structural transfer.

This repository is a public research record. It is designed to make the evidence legible without exposing implementation details, training recipes, checkpoints, or reproduction-critical architecture internals.

Last updated: 2026-04-20


Short version

LUMI-Arch asks whether a compact model can learn more useful structure per unit of compute by adding stronger compression pressure and structure-aware training signals, rather than relying only on brute-force scale.

The current public record supports three claims:

Claim Public status What it means
The direction is not just a symbolic toy Supported Earlier structured-task wins motivated scale-up.
The architecture family can beat a matched Transformer on a controlled 300M compression test Supported 300M LUMI-family run reached 1.2341 BPB vs 1.4220 BPB for the matched Transformer baseline.
Larger training is feasible, but the program is still diagnostic Supported A 996M pilot reached stable early convergence, but it was not a full benchmark run.
A newer context-efficiency direction is active Private diagnostic track Current private work studies whether compact byte-level models can use context more efficiently without publishing the mechanism.

The current public record does not claim state-of-the-art assistant ability, open reproduction, or a solved general-reasoning system.


Current public evidence

Compression and scale diagnostics

Experiment Public-safe result Interpretation
300M matched comparison LUMI-family: 1.2341 BPB / Transformer: 1.4220 BPB A controlled compression win at matched scale.
300M seed consistency 4 / 4 seeds favored LUMI-family The result was not a single lucky run.
1B pilot 996M parameters, early C4 BPB 2.1616 at 2K steps Training at near-1B scale is operationally feasible, but not yet a final-quality claim.
66.5M private diagnostic branch internal structural transfer 0.590, text BPB 1.831 Newer private diagnostics show stronger transfer behavior, but implementation details are intentionally withheld.

Public-safe external evaluation snapshot

The 66.5M private diagnostic branch was also checked with standard language-model evaluation tasks. These are useful as a reality check: LUMI is showing non-random behavior on several tasks, but it is not yet a broadly strong assistant model.

Benchmark Metric used Result
HellaSwag acc_norm 0.292
PIQA acc_norm 0.529
ARC-Easy acc_norm 0.271
ARC-Challenge acc_norm 0.227
Winogrande acc 0.507
LAMBADA acc 0.040

These numbers are included to avoid cherry-picking internal metrics. They show the current gap clearly: structural transfer is improving faster than broad natural-language competence.


Why this is interesting

Most compact-model projects improve by changing data, scale, or training duration. LUMI-Arch is testing a different bet:

Can compression pressure and structural transfer act as a training pressure that improves the model's internal representations across domains?

The strongest current hypothesis is that the concept/compression mechanism is useful less as a direct inference-time module and more as a training constraint: it forces the model to organize representations in ways that transfer better.

That hypothesis is still being tested. The public evidence says the direction is worth testing further; it does not say the problem is solved.


Current private direction

A newer private track is testing context efficiency in compact byte-level models. The motivation is simple: byte-level systems avoid tokenizer dependence, but they can spend raw context faster than subword-token systems.

The public-safe question is:

Can compression-oriented training pressure improve usable context across longer dependencies without destroying base language-model behavior?

This repository will report only high-level motivation, evaluation categories, and aggregate public-safe results when available. It will not disclose the mechanism, implementation, exact architecture internals, recipes, data mixtures, checkpoints, or reproduction-critical metrics.


Disclosure boundary

To reduce imitation risk, this repository intentionally shares:

Shared publicly Not shared publicly
benchmark summaries source code
public-safe data tables checkpoints or weights
failure-mode interpretation exact architecture internals
compute needs and experiment goals training recipes and hyperparameter sweeps
high-level research direction private data mixtures or curriculum details

If you are evaluating the project for compute support or collaboration, the intended signal is the research trajectory and evidence quality, not a recipe for reproducing the system.


Public materials

Path Description
public_results_snapshot.md One-page public summary for compute / partner outreach
support_brief.md Short compute-support brief
COMPUTE_REQUEST.md Email-ready compute support framing
METHODOLOGY.md Public-safe methodology and disclosure boundary
DISCLOSURE_POLICY.md Explicit public/private/NDA disclosure boundary
results/300m_scale_comparison.md 300M parameter-matched comparison
results/1b_pilot_convergence.md 1B pilot convergence note
results/2026_04_public_diagnostics.md April 2026 public-safe diagnostic snapshot
results/2026_04_context_efficiency_private_track.md Public-safe summary of the private context-efficiency track
results/structured_tasks.md Structured-task benchmark summary
data/300m_comparison.json Machine-readable 300M result data
data/1b_pilot_results.json Machine-readable 1B pilot data
data/2026_04_public_diagnostics.json Machine-readable April 2026 public-safe diagnostics
data/2026_04_context_efficiency_private_track.json Machine-readable public-safe context-efficiency track summary

Tables are treated as canonical. Images and plots, when present, are auxiliary only.


Compute support

The bottleneck is controlled compute, not idea generation.

The highest-value next step is not one huge blind scale run. It is a set of carefully separated experiments that test:

Question Why it matters
Does the structural-transfer signal survive at 100M+ and 1B+? This tests whether the current 66.5M signal is a scaling direction or a small-model artifact.
Can compact byte-level models improve usable context efficiency? This tests whether compression pressure can reduce some raw-context disadvantage without exposing a reproduction path.
Do broader text/code/math mixtures improve external tasks without destroying structured transfer? This tests whether LUMI is becoming more general, not merely better at one diagnostic domain.
Can instruction-style behavior be added without erasing base capability? This tests whether conversation ability is a conversion of base capability or just superficial tuning.
Which failures are architecture failures vs data/curriculum failures? This determines whether scaling is justified.

What is being offered:

  • a clear public research record
  • controlled experiment plans
  • negative-result analysis
  • public-safe benchmark reporting
  • an unusual architecture direction with enough signal to justify more compute

What is not being offered:

  • weights
  • full implementation disclosure
  • reproduction-level recipes
  • private training data mixtures

Current status

Area Status
300M controlled compression comparison Public result available
1B pilot Public convergence note available
66.5M 2026 diagnostics Public-safe summary available
context-efficiency track Private diagnostic track, public-safe motivation available
broad assistant/chat capability Not yet established
next major need controlled 100M+ and 1B+ compute

Contact

Project: LUMI-Arch
Author: @Mikeore
Scope: independent research, 2025-2026

If you are interested in compute support, collaboration, or benchmarking discussion, feel free to reach out via GitHub.

About

Public research notes on compact architecture exploration for efficient language modeling.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors