- Clone or download this repo.
cdyourself to it's root directory. - Create and activate python conda enviromnent:
conda create --name data-distil python=3.8 - Activate conda environment:
conda activate data-distil - Install dependencies, using
pip install -r requirements.txt - Set
huggingfaceandopenaicredentials in.env
- Collect SLM's and LLM's predictions by running:
python generation/seed_inference.py - Run:
python generation/lion/generate.py(forslm_preds_pathandllm_preds_pathuse save paths from Step 1.)- this will generate the following datasets (
<SAVE_DIR>is specified ingenerate.pyscript,<DATE_TIME>is an automatically generated data id):<SAVE_DIR>/<DATE_TIME>/lion_all<SAVE_DIR>/<DATE_TIME>/lion_hard<SAVE_DIR>/<DATE_TIME>/lion_easy
- this will generate the following datasets (
Supported SLMs:
- Run:
pyhton finetuning/pipeline.py- for finetuning on the Lion data, set
ds_trainparam to the path of one of the locally saved Lion datasets, e.g.ds_train = "data/gsm8k/lion/2024-07-06_18-20-59/lion_hard"(here<SAVE_DIR> = "data/gsm8k/lion"and<DATE_TIME> = "2024-07-06_18-20-59")
- for finetuning on the Lion data, set