Add bpb metric by klei22 · Pull Request #847 · ReaLLMASIC/ReaLLM-Forge

klei22 · 2026-06-17T05:55:05Z

No description provided.

Copilot

Pull request overview

This PR adds a validation bits-per-byte (bpb) metric derived from dataset UTF-8 byte counts, and extends training to support multiple selectable stopping limiters (iters/epochs/seconds/tokens). It also updates exploration tooling and dataset preparation scripts to record the needed byte metadata and surface the new metric.

Changes:

Add bpb scaling from meta.pkl and log bpb to TensorBoard + best_val_loss_and_iter.txt.
Introduce configurable training stop limiters (--training_limiters + --max_epochs/--max_seconds/--max_tokens) and update ETA/progress estimation.
Update experiment runners/monitors and add a FLORES-200 tokenizer-comparison exploration config + demo script.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`train.py`	Computes/logs bits-per-byte and writes it into `best_val_loss_and_iter.txt`; adds multi-limiter stopping logic and ETA estimation changes.
`train_args.py`	Adds CLI flags for new stopping limiters and bpb logging toggle.
`run_exploration_monitor.py`	Adds `best_val_bits_per_byte` column and CSV formatting support.
`optimization_and_search/run_from_yaml.py`	Extends metric schema to include `best_val_bits_per_byte` (but needs a robust backward-compat parse fix).
`optimization_and_search/run_experiments.py`	Extends metric schema to include `best_val_bits_per_byte` (but needs a robust backward-compat parse fix).
`explorations/flores200_bits_per_byte_tokenizer_comparison.yaml`	New exploration config for comparing tokenizers via validation bpb.
`data/template/prepare.py`	Records UTF-8 byte/token counts into `meta.pkl` for bpb scaling.
`data/flores200-res/demo_bits_per_byte_tokenizers.sh`	New helper script to prepare FLORES-200 datasets for bpb comparisons.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # Take only the core values and cast them appropriately.
+    if len(parts) == len(METRIC_KEYS) - 1:
+        # Backward compatibility for runs created before best_val_bits_per_byte.
+        parts.insert(1, "nan")
    if len(parts) < len(METRIC_KEYS):


+    if len(parts) == len(base_metric_keys) - 1:
+        # Backward compatibility for runs created before best_val_bits_per_byte.
+        parts.insert(1, "")
    if len(parts) < len(base_metric_keys):


                    metrics = [
                            f"{self.best_val_loss.item()}",
+                            f"{self._val_bits_per_byte(self.best_val_loss, self.args.dataset):.6f}",
                            f"{self.iter_num}",


klei22 and others added 4 commits June 16, 2026 11:53

Add bits per byte logging

d37e95e

fix(monitor): show bits per byte metric

4e6c13d

feat(training): add flexible stop limiters

ecd9a89

Update bpb yaml file to allow overfitting for each

41cb743

klei22 requested review from Copilot and gkielian June 17, 2026 05:55

Copilot started reviewing on behalf of klei22 June 17, 2026 05:55 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

klei22 and others added 2 commits June 16, 2026 23:08

Merge branch 'master' into add-bpb-metric

6369c41

Remove stray text in train.py

916fe74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bpb metric#847

Add bpb metric#847
klei22 wants to merge 6 commits into
ReaLLMASIC:masterfrom
klei22:add-bpb-metric

klei22 commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

klei22 commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants