I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.
These are my results
| Model |
SciQ |
PIQA |
| forge-bio |
0.788 |
|
| forge-che |
0.821 |
|
| forge-eng |
0.793 |
|
| forge-mat |
0.777 |
|
| forge-phy |
0.761 |
|
| forge-soc |
0.82 |
|
| forge-s1 |
0.787 |
|
| forge-s2 |
0.783 |
|
| forge-s3 |
0.805 |
|
| forge-s4 |
0.86 |
|
| forge-m1 |
0.82 |
|
| forge-m2 |
0.574 |
0.5577 |
| forge-l |
0.242 |
|
The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (data/eval/forge-m2) suggests that these are roughly the scores of the m2 checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.
I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.
Command line
lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda
I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.
These are my results
The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (
data/eval/forge-m2) suggests that these are roughly the scores of them2checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.
Command line