How to preprocess data for training?
I am trying to launch the debug script after installation,
python -m bytelatent.train config=bytelatent/configs/debug.yaml
and I see that I need to specify preprocess_dir in the yaml file for training. However, there is no instruction for how to preprocess data. (I guess we should use preprocess_entropies.py or parallel_entropies.py.)
What I tried
I attempted to preprocess .jsonl files (in fineweb_edu_10bt_shuffled dataset) one by one using:
python -m bytelatent.preprocess.preprocess_entropies fineweb_edu_10bt.chunk.00.jsonl output_dir
However, I am unsure about which .pth file should be specified for the entropy model:
- consolidated_with_rope.pth
- consolidated.pth
For enhancing reproducibility, could you please provide a simple guide or script for preprocessing data for training? Many thanks!
How to preprocess data for training?
I am trying to launch the debug script after installation,
and I see that I need to specify
preprocess_dirin the yaml file for training. However, there is no instruction for how to preprocess data. (I guess we should usepreprocess_entropies.pyorparallel_entropies.py.)What I tried
I attempted to preprocess
.jsonlfiles (infineweb_edu_10bt_shuffleddataset) one by one using:However, I am unsure about which .pth file should be specified for the entropy model:
For enhancing reproducibility, could you please provide a simple guide or script for preprocessing data for training? Many thanks!