Bug: labels not masking padding tokens model trains on pad positions #56

newtscammander · 2026-06-04T08:03:12Z

newtscammander
Jun 4, 2026

Found a bug in hf_deploy/trainer/dataset.py while integrating TuneOS into a training pipeline.

In load_and_tokenize, labels are created by directly copying input_ids:

tokenized = tokenized.map(lambda x: {"labels": x["input_ids"].copy()})

Since loader.py sets tokenizer.pad_token = tokenizer.eos_token and the tokenizer uses padding="max_length", all padding positions in input_ids are filled with the eos token id. When labels copy this directly, the model computes loss over those padding positions too it learns to predict eos tokens at every padded step, which degrades training quality especially with shorter samples in a batch.

The fix is to mask padding positions in labels with -100 so they're ignored by the cross-entropy loss:

def mask_labels(example):
    labels = example["input_ids"].copy()
    labels = [
        -100 if token == tokenizer.pad_token_id else token
        for token in labels
    ]
    return {"labels": labels}

tokenized = tokenized.map(mask_labels)

Also noticed a second related issue there's a tokenize function defined on line 32 that's never called anywhere. The actual tokenization happens in the map call below it. The dead function might confuse contributors thinking that's the active code path.

Tested with Mistral 7B on a 2k sample JSONL dataset training loss was noticeably noisier before the fix, stabilized after masking pads properly. Happy to open a PR if this is confirmed.

Answered by SahilKumar75

Jun 4, 2026

Thanks for catching this and laying it out so clearly both issues are confirmed.

On the labels masking bug

You're right. Since loader.py sets tokenizer.pad_token = tokenizer.eos_token, every padding position in input_ids gets the eos token id, and copying that directly into labels means the loss is computed over all those pad positions. The model ends up spending capacity learning to predict eos at padded steps, which is noise.

Your fix is correct. Here's the version we'll use in the patch slightly adjusted to handle the batched case cleanly since map with batched=True is faster on large datasets:

def mask_labels(example):
    labels = example["input_ids"].copy()
    labels = [-100 if token

View full answer

SahilKumar75 · 2026-06-04T08:05:42Z

SahilKumar75
Jun 4, 2026
Maintainer

Thanks for catching this and laying it out so clearly both issues are confirmed.

On the labels masking bug

You're right. Since loader.py sets tokenizer.pad_token = tokenizer.eos_token, every padding position in input_ids gets the eos token id, and copying that directly into labels means the loss is computed over all those pad positions. The model ends up spending capacity learning to predict eos at padded steps, which is noise.

Your fix is correct. Here's the version we'll use in the patch slightly adjusted to handle the batched case cleanly since map with batched=True is faster on large datasets:

def mask_labels(example):
    labels = example["input_ids"].copy()
    labels = [-100 if token == tokenizer.pad_token_id else token for token in labels]
    return {"labels": labels}

tokenized = tokenized.map(mask_labels)

For the batched version:

def mask_labels_batched(batch):
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in ids]
        for ids in batch["input_ids"]
    ]
    return batch

tokenized = tokenized.map(mask_labels_batched, batched=True)

On the dead tokenize function

Also confirmed the tokenize function defined at line 32 is never called. The actual tokenization goes through the inline map lambda below it. Will remove it to avoid confusion for contributors.

On instruction masking (related)

While we're fixing this, worth noting there's a third related issue the current labels include the instruction tokens too, meaning the model is learning to predict both the ### Instruction: part and the ### Response: part. Ideally only the response tokens should contribute to the loss. We'll track that as a separate improvement since it's a design decision rather than a bug.

PR is welcome if you want to open one against hf_deploy/trainer/dataset.py with both fixes, we'll get it merged.

0 replies

newtscammander · 2026-06-04T08:08:51Z

newtscammander
Jun 4, 2026
Author

LGTM!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: labels not masking padding tokens model trains on pad positions #56

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Bug: labels not masking padding tokens model trains on pad positions #56

Uh oh!

Uh oh!

newtscammander Jun 4, 2026

Replies: 2 comments

Uh oh!

SahilKumar75 Jun 4, 2026 Maintainer

Uh oh!

newtscammander Jun 4, 2026 Author

newtscammander
Jun 4, 2026

SahilKumar75
Jun 4, 2026
Maintainer

newtscammander
Jun 4, 2026
Author