Skip to content
Discussion options

You must be logged in to vote

Thanks for catching this and laying it out so clearly both issues are confirmed.

On the labels masking bug

You're right. Since loader.py sets tokenizer.pad_token = tokenizer.eos_token, every padding position in input_ids gets the eos token id, and copying that directly into labels means the loss is computed over all those pad positions. The model ends up spending capacity learning to predict eos at padded steps, which is noise.

Your fix is correct. Here's the version we'll use in the patch slightly adjusted to handle the batched case cleanly since map with batched=True is faster on large datasets:

def mask_labels(example):
    labels = example["input_ids"].copy()
    labels = [-100 if token 

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by SahilKumar75
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants