Add bpb metric#847
Open
klei22 wants to merge 6 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a validation bits-per-byte (bpb) metric derived from dataset UTF-8 byte counts, and extends training to support multiple selectable stopping limiters (iters/epochs/seconds/tokens). It also updates exploration tooling and dataset preparation scripts to record the needed byte metadata and surface the new metric.
Changes:
- Add bpb scaling from
meta.pkland log bpb to TensorBoard +best_val_loss_and_iter.txt. - Introduce configurable training stop limiters (
--training_limiters+--max_epochs/--max_seconds/--max_tokens) and update ETA/progress estimation. - Update experiment runners/monitors and add a FLORES-200 tokenizer-comparison exploration config + demo script.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
train.py |
Computes/logs bits-per-byte and writes it into best_val_loss_and_iter.txt; adds multi-limiter stopping logic and ETA estimation changes. |
train_args.py |
Adds CLI flags for new stopping limiters and bpb logging toggle. |
run_exploration_monitor.py |
Adds best_val_bits_per_byte column and CSV formatting support. |
optimization_and_search/run_from_yaml.py |
Extends metric schema to include best_val_bits_per_byte (but needs a robust backward-compat parse fix). |
optimization_and_search/run_experiments.py |
Extends metric schema to include best_val_bits_per_byte (but needs a robust backward-compat parse fix). |
explorations/flores200_bits_per_byte_tokenizer_comparison.yaml |
New exploration config for comparing tokenizers via validation bpb. |
data/template/prepare.py |
Records UTF-8 byte/token counts into meta.pkl for bpb scaling. |
data/flores200-res/demo_bits_per_byte_tokenizers.sh |
New helper script to prepare FLORES-200 datasets for bpb comparisons. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+82
to
86
| # Take only the core values and cast them appropriately. | ||
| if len(parts) == len(METRIC_KEYS) - 1: | ||
| # Backward compatibility for runs created before best_val_bits_per_byte. | ||
| parts.insert(1, "nan") | ||
| if len(parts) < len(METRIC_KEYS): |
Comment on lines
+621
to
624
| if len(parts) == len(base_metric_keys) - 1: | ||
| # Backward compatibility for runs created before best_val_bits_per_byte. | ||
| parts.insert(1, "") | ||
| if len(parts) < len(base_metric_keys): |
Comment on lines
1992
to
1995
| metrics = [ | ||
| f"{self.best_val_loss.item()}", | ||
| f"{self._val_bits_per_byte(self.best_val_loss, self.args.dataset):.6f}", | ||
| f"{self.iter_num}", |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.