This is the repo for the paper "Model provenance testing for Large Language Models".
It consists of:
- Python implementation of the tester
- The two benchmarks (Bench-A,Bench-B) of LLMs evaluated in the paper (see folder
data) - The random sentence prompts used by the tester (see
data/random_sentences.txt)
The tester will download all required LLMs from HuggingFace. You can test LLMs from the two benchmarks, or your own (see below). To produce the full evaluation of the two benchmarks, it requires around 1TB of space and around 2 days (mostly spent downloading LLMs from HuggingFace).
Run
pip install -r requirements.txt
to install all required Python modules.
To run the two benchmarks you can use the provided files of parent and candidate models, i.e.
python tester.py --file_parents data/bench_A_parents.txt --file_candidates data/bench_A_candidates.txt
python tester.py --file_parents data/bench_B_parents.txt --file_candidates data/bench_B_candidates.txt
You can run quick test (it checks only against parenst from Bench-A and Bench-B), without specifying anything but a comma separated list of candidates for testing, e.g.
python tester.py --quick_test AUTOMATIC/promptgen-majinai-unsafe,ggml-org/stories15M_MOE,pranavpsv/genre-story-generator-v2
This will only download the tested models, and will use the provided cached outputs for the parent models, so it should be relatively fast.
To properly run the tester you need at least to specify two files: one for the parent models and one for the tested models.
python tester.py --file_parents <parent_file> --file_candidates <tested_models_file>
The file <parent_file> contains one HuggingFace model per line, e.g.
openai-community/gpt2
EleutherAI/pythia-70m
microsoft/DialoGPT-medium
...
The file <tested_models_file> contains lines of tested models, which can be specified with ground truth parent model (format <tested_model>,<parent_model>), or without,
i.e. either with parent (provide None, if tested model does not have parent)
BEE-spoke-data/zephyr-220m-sft-full,BEE-spoke-data/smol_llama-220M-openhermes
codeparrot/codeparrot-small,None
...
of without parent
BEE-spoke-data/zephyr-220m-sft-full
codeparrot/codeparrot-small
...
In the latter case, the tester will just output the parent guess, whereas in the former case, in addition it will provide statistics (percentage, recall, ...) about the correct guesses against the provided parents.
The tester first caches the outputs of all parent LLMs accross different set of prompts (because they are reused accross all tested models) and then stores them. All future testers can use these cached outputs, by specifying the prompt_id, i.e. use --prompt_id <prompt_cache_id>, where <prompt_cache_id> increase sequentially 0,1,..., check the folder cached_prompts for available ids. That is, given a fresh set of parents the first run would be
python tester.py --file_parents parents.txt --file_candidates children.txt
and all subsequent tests with this same set of parents can reuse the cached outputs (find the latest id in the folder cached_prompts, e.g. 3_..., so use
python tester.py --prompt_id 3 --file_parents parents.txt --file_candidates new_children.txt
The tester supports more advanced options (e.g. adjusting number of prompts, device used for inference, ...), check --help for extensive list.