Skip to content

Unclear how to preprocess entropy data #133

@YenChen-Wu

Description

@YenChen-Wu

How to preprocess data for training?

I am trying to launch the debug script after installation,

python -m bytelatent.train  config=bytelatent/configs/debug.yaml

and I see that I need to specify preprocess_dir in the yaml file for training. However, there is no instruction for how to preprocess data. (I guess we should use preprocess_entropies.py or parallel_entropies.py.)

What I tried

I attempted to preprocess .jsonl files (in fineweb_edu_10bt_shuffled dataset) one by one using:

python -m bytelatent.preprocess.preprocess_entropies fineweb_edu_10bt.chunk.00.jsonl output_dir

However, I am unsure about which .pth file should be specified for the entropy model:

  • consolidated_with_rope.pth
  • consolidated.pth

For enhancing reproducibility, could you please provide a simple guide or script for preprocessing data for training? Many thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions