Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
a16808b
implement pretraining data preparation script, which creates a subset…
ilkerkesen Feb 11, 2026
a7faae7
adapt welt pretraining implementation to work with the new data format
ilkerkesen Feb 11, 2026
371e8ff
document how to run the data preparation script
ilkerkesen Feb 11, 2026
d95038d
make shard limit naming more flexible (100_000 -> 100_000_000)
ilkerkesen Feb 11, 2026
2986d4d
move the module import to top-level
ilkerkesen Feb 11, 2026
6843bf2
fix bugs listed in the PR review
ilkerkesen Feb 11, 2026
630666e
fix the issues raised in the PR review
ilkerkesen Feb 11, 2026
b81d46d
fix lint errors
ilkerkesen Feb 11, 2026
b3667ea
implement data loading for the prepared data within the package, so w…
ilkerkesen Feb 11, 2026
fba018b
create train / validation splits at preparation time
ilkerkesen Feb 11, 2026
6800329
separate train and validation splits at data preparation phase
ilkerkesen Feb 11, 2026
69f85a3
make [(train/validation)]_split_units args required for data preparat…
ilkerkesen Feb 11, 2026
afbc16c
implement extract_text procedure to prevent duplicated text_template …
ilkerkesen Feb 11, 2026
fee1721
make ShardWriter a context manager, initiated with /with/ statements …
ilkerkesen Feb 11, 2026
d1e127c
do not count whitespace chars while keeping statistics
ilkerkesen Feb 11, 2026
3323330
refactor data preparation tests
ilkerkesen Feb 11, 2026
fc33081
rely on the words segmenter to count number of characters
ilkerkesen Feb 11, 2026
fc19d87
separate split metadata files and enable preparation per split
ilkerkesen Feb 12, 2026
dc20377
implement verification for the prepared data
ilkerkesen Feb 12, 2026
c68e192
apply refactorings suggested by claude
ilkerkesen Feb 12, 2026
0651fc8
handle mutliple resources while verying the data
ilkerkesen Feb 12, 2026
3330539
discard extra columns in training script
ilkerkesen Feb 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,80 @@ You can also turn off a specific encoder after training has completed, for testi
> all the embeddings of the previous tokens (on the word level). This is done since not all causal LMs support
> cross-attention, and so we want to avoid using it, and rely on the self-attention mechanism instead.

## Data Preparation

You can prepare datasets offline using the `welt-prepare-data` CLI.
It streams a HuggingFace dataset, samples raw text with unit-based limits, and writes sharded `.jsonl.gz` files:

```shell
welt-prepare-data \
--dataset_name HuggingFaceFW/fineweb \
--dataset_config sample-10BT \
--train_split_units 3200000000 \
--validation_split_units 100000000 \
--num_units_per_file 100000000 \
--max_seq_length 512 \
--seed 42 \
--output_path /scratch/data/pretrain
```

Multiple datasets can be prepared into the same output directory:

```shell
welt-prepare-data \
--dataset_name monology/pile-uncopyrighted \
--train_split_units 1000000000 \
--validation_split_units 50000000 \
--num_units_per_file 100000000 \
--max_seq_length 512 \
--output_path /scratch/data/pretrain
```

The output directory contains sharded `.jsonl.gz` files and a `{prefix}-metadata.json` per dataset:

```
/scratch/data/pretrain/
├── fineweb-sample-10BT-00000000.jsonl.gz
├── fineweb-sample-10BT-00000001.jsonl.gz
├── fineweb-sample-10BT-metadata.json
├── pile-uncopyrighted-00000000.jsonl.gz
└── pile-uncopyrighted-metadata.json
```

Then train using the prepared data:

```shell
welt-train config.yaml --prepared_data_path /scratch/data/pretrain
```

| Argument | Description |
|----------|-------------|
| `--dataset_name` | HuggingFace dataset identifier (required) |
| `--dataset_config` | Dataset config name (optional) |
| `--dataset_split` | Split to use (default: "train") |
| `--text_column` | Column containing text (default: "text") |
| `--text_template` | Python format string template (optional) |
| `--language` | Language tag to store with each example (e.g., "eng_Latn") |
| `--unit_type` | Unit type for counting: "words" or "chars" (default: "words") |
| `--train_split_units` | Number of units for the train split (default: 0, no train shards) |
| `--validation_split_units` | Number of units for the validation split (default: 0, no validation shards) |
| `--num_units_per_file` | Max units per shard file (optional) |
| `--max_seq_length` | Max words per example; splits long documents using word segmentation |
| `--max_bytes_per_word` | Max UTF-8 bytes per word; should match training config `max_word_length - 2` (default: 126) |
| `--seed` | Random seed for shuffling |
| `--drop_remainder` | Drop partial chunks at document boundaries |
| `--output_path` | Output directory path (required) |

### Verifying Prepared Data

After preparing data, verify integrity with `welt-verify-data`:

```shell
welt-verify-data --data_path /scratch/data/pretrain
```

This checks that shard counts and example counts match the metadata, and warns if train/validation splits from the same source were created separately (risking data contamination).

## Training

Training instructions are available in the [welt_training/README.md](./welt_training/README.md).
Expand Down
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ train = [
"scikit-learn", # For "accuracy" metric in evaluate
"sacrebleu", # For usual bleu/chrF metrics
"wandb", # For experiment tracking
"zstandard", # For compressing the data
]

dion = [
Expand Down Expand Up @@ -79,3 +80,5 @@ testpaths = [

[project.scripts]
welt-train = "welt_training.train:train"
welt-prepare-data = "welt_training.prepare_data:main"
welt-verify-data = "welt_training.verify_data:main"
Loading
Loading