Skip to content

Question about pretraining.  #7

@TeddLi

Description

@TeddLi

I attached my training loss below, the data we are using refers to LLM360's paper, we use less data starcode.
For each training epoch our data contains 30B arxiv , Book 57B, C4 197.67B, Refined-Web 665.01, StarCoder 150B, StackExchange 21.75B, Wikipedia 23.90B.
And the hyperparameter we are using the same as LLM360 demonstrated. And the max_seq_len is 4096 instead of 2048, tokenizer is gpt tokenzier.
We are using an opensource repo to run the experiment on H100 Node with 2048 global bsize.
Currently our model can only achieve around 10.5 PPL on the falcon dataset. which is much worse than LLM360 amber model (Around 8 PPL) and llama-2 (Around 8 PPL).
Just wondering what would be the possible reason that our model perform much worse?
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions