Bug: labels not masking padding tokens model trains on pad positions #56
-
|
Found a bug in In tokenized = tokenized.map(lambda x: {"labels": x["input_ids"].copy()})Since The fix is to mask padding positions in labels with def mask_labels(example):
labels = example["input_ids"].copy()
labels = [
-100 if token == tokenizer.pad_token_id else token
for token in labels
]
return {"labels": labels}
tokenized = tokenized.map(mask_labels)Also noticed a second related issue there's a Tested with Mistral 7B on a 2k sample JSONL dataset training loss was noticeably noisier before the fix, stabilized after masking pads properly. Happy to open a PR if this is confirmed. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Thanks for catching this and laying it out so clearly both issues are confirmed. On the labels masking bug You're right. Since Your fix is correct. Here's the version we'll use in the patch slightly adjusted to handle the batched case cleanly since def mask_labels(example):
labels = example["input_ids"].copy()
labels = [-100 if token == tokenizer.pad_token_id else token for token in labels]
return {"labels": labels}
tokenized = tokenized.map(mask_labels)For the batched version: def mask_labels_batched(batch):
batch["labels"] = [
[-100 if token == tokenizer.pad_token_id else token for token in ids]
for ids in batch["input_ids"]
]
return batch
tokenized = tokenized.map(mask_labels_batched, batched=True)On the dead Also confirmed the On instruction masking (related) While we're fixing this, worth noting there's a third related issue the current labels include the instruction tokens too, meaning the model is learning to predict both the PR is welcome if you want to open one against |
Beta Was this translation helpful? Give feedback.
-
|
LGTM! |
Beta Was this translation helpful? Give feedback.
Thanks for catching this and laying it out so clearly both issues are confirmed.
On the labels masking bug
You're right. Since
loader.pysetstokenizer.pad_token = tokenizer.eos_token, every padding position ininput_idsgets the eos token id, and copying that directly intolabelsmeans the loss is computed over all those pad positions. The model ends up spending capacity learning to predict eos at padded steps, which is noise.Your fix is correct. Here's the version we'll use in the patch slightly adjusted to handle the batched case cleanly since
mapwithbatched=Trueis faster on large datasets: