Skip to content

Add bpb metric#847

Open
klei22 wants to merge 6 commits into
ReaLLMASIC:masterfrom
klei22:add-bpb-metric
Open

Add bpb metric#847
klei22 wants to merge 6 commits into
ReaLLMASIC:masterfrom
klei22:add-bpb-metric

Conversation

@klei22

@klei22 klei22 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a validation bits-per-byte (bpb) metric derived from dataset UTF-8 byte counts, and extends training to support multiple selectable stopping limiters (iters/epochs/seconds/tokens). It also updates exploration tooling and dataset preparation scripts to record the needed byte metadata and surface the new metric.

Changes:

  • Add bpb scaling from meta.pkl and log bpb to TensorBoard + best_val_loss_and_iter.txt.
  • Introduce configurable training stop limiters (--training_limiters + --max_epochs/--max_seconds/--max_tokens) and update ETA/progress estimation.
  • Update experiment runners/monitors and add a FLORES-200 tokenizer-comparison exploration config + demo script.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
train.py Computes/logs bits-per-byte and writes it into best_val_loss_and_iter.txt; adds multi-limiter stopping logic and ETA estimation changes.
train_args.py Adds CLI flags for new stopping limiters and bpb logging toggle.
run_exploration_monitor.py Adds best_val_bits_per_byte column and CSV formatting support.
optimization_and_search/run_from_yaml.py Extends metric schema to include best_val_bits_per_byte (but needs a robust backward-compat parse fix).
optimization_and_search/run_experiments.py Extends metric schema to include best_val_bits_per_byte (but needs a robust backward-compat parse fix).
explorations/flores200_bits_per_byte_tokenizer_comparison.yaml New exploration config for comparing tokenizers via validation bpb.
data/template/prepare.py Records UTF-8 byte/token counts into meta.pkl for bpb scaling.
data/flores200-res/demo_bits_per_byte_tokenizers.sh New helper script to prepare FLORES-200 datasets for bpb comparisons.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +82 to 86
# Take only the core values and cast them appropriately.
if len(parts) == len(METRIC_KEYS) - 1:
# Backward compatibility for runs created before best_val_bits_per_byte.
parts.insert(1, "nan")
if len(parts) < len(METRIC_KEYS):
Comment on lines +621 to 624
if len(parts) == len(base_metric_keys) - 1:
# Backward compatibility for runs created before best_val_bits_per_byte.
parts.insert(1, "")
if len(parts) < len(base_metric_keys):
Comment thread train.py
Comment on lines 1992 to 1995
metrics = [
f"{self.best_val_loss.item()}",
f"{self._val_bits_per_byte(self.best_val_loss, self.args.dataset):.6f}",
f"{self.iter_num}",
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants