Skip to content

Dirtyboy1029/CopyCheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CopyCheck

Overview:

This code repository is associated with our paper titled As If We’ve Met Before: LLMs Exhibit Certainty in Recognizing Seen Files. In this paper, we propose a novel tool, CopyCheck, to detect which files in a set suspected seen files by the target LLM have actually been seen and which have not.

Dependencies:

We develop the codes on Windows operation system, and run the codes on Ubuntu 20.04. The runtime environment for the code is the same as that of BLoB.

Dataset:

Source data: BookMIA.

Usage:

1. Estimate the target LLM uncertainty based on the BLoB project.

To accomplish our goal, we developed our code based on BLoB and modified bayesian_peft/run/main.py.

 cd my_bayesian_peft
 bash scripts/my_script.sh

Explain the contents of scripts/my_script.sh.

Fine-tuning target LLM:

 for dataset in bookmia_bonlyseen_10 bookmia_base10_test10
 do
 for epoch in 1
 do
     CUDA_VISIBLE_DEVICES=0 python3 run/main.py --dataset-type mcdataset --dataset $dataset --model-type causallm \
     --model openlm-research/open_llama_7b --modelwrapper mcdropout --lr 1e-4 --batch-size 4  --opt adamw --warmup-ratio 0.06 \
     --max-seq-len 300  --nowand --load-model-path ./my_llm/openlm-research/open_llama_7b \
     --apply-classhead-lora --lora-r 8 --lora-alpha 16 --lora-dropout 0.1  \
      --checkpoint --checkpoint-dic-name epoch$epoch/open_llama_7b-1 --seed $epoch --n-epochs $epoch
 done
 done

Inference:

 for dataset in bookmia_bonlyseen_10 bookmia_bonlyseen_20 bookmia_bonlyseen_30 bookmia_bonlyseen_40
 do
 for epoch in 1
 do
     CUDA_VISIBLE_DEVICES=0 python3 run/main.py --dataset-type mcdataset --dataset $dataset \
     --ood-ori-dataset $dataset --model-type causallm --model openlm-research/open_llama_7b --modelwrapper mcdropout --lr 1e-4 \
     --batch-size 4  --opt adamw --warmup-ratio 0.06  --max-seq-len 300 \
     --apply-classhead-lora --lora-r 8 --lora-alpha 16  --lora-dropout 0.1\
     --nowand --load-model-path epoch$epoch/open_llama_7b-1 --load-checkpoint
 done
 done

Hyperparameters:

 Target LLM: openlm-research/open_llama_7b, openlm-research/open_llama_3b, meta-llama/Llama-2-7b, huggyllama/llama-7b, EleutherAI/gpt-j-6b
 dataset: 'bookmia_bonlyseen_10'  bonlyseen: no seen files,  10: $N_unseen = 10$, 20: $N_unseen = 20$, 30: $N_unseen = 30$, 40: $N_unseen = 40$
 modelwrapper: uncertainty estimation method.
           'mcdropout': MCD, 'blob': BLoB, 'deepensemble': Ensemble

2. Calculation of uncertainty metrics.

 cd my_bayesian_peft/myexperiments
 python3 generate_uc_feature_csv.py  ## generate a uncertainty feature csv file to my_bayesian_peft/myexperiments/feature_csv

3. Unseen file Detection.

 No Seen Files:  python3 no_seen_files.py  -detection_algorithm_type gmm or hierarchical or kmeans

We also modified the code and generated the feature CSV file to sota_tool/X_feature, the label files you can find at sota_tool/data/final_chunks. meta-classifier model files: sota_tool/rf-7b.pkl and sota_tool/rf-3b.pkl

reproduce our experiment results:

 cd sota_tool
 python myexp.py
 ## experiment_type, 'train' or 'test'.
 ## testset: N_unseen. '10','20','30','40'.
 ## llm_type: open_llama_3b: 3b, open_llama_7b: 7b.

reproduce our experiment results:

 cd ablation
 python my_experiments.py
 ## detection_algorithm_type, 'gmm'.
 ## nr: N_unseen. '10','20','30','40'.
 ## llm_type: llama-7b, llama2-7b.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors