Possibly wrong checkpoints for M2 and L

I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.

These are my results

Model | SciQ | PIQA
-- | -- | --
forge-bio | 0.788 |  
forge-che | 0.821 |  
forge-eng | 0.793 |  
forge-mat | 0.777 |  
forge-phy | 0.761 |  
forge-soc | 0.82 |  
forge-s1 | 0.787 |  
forge-s2 | 0.783 |  
forge-s3 | 0.805 |  
forge-s4 | 0.86 |  
forge-m1 | 0.82 |  
forge-m2 | **0.574** | **0.5577**
forge-l | **0.242** |  

The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (`data/eval/forge-m2`) suggests that these are roughly the scores of the `m2` checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.

I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.

Command line
```
 lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda
 ```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibly wrong checkpoints for M2 and L #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	SciQ	PIQA
forge-bio	0.788
forge-che	0.821
forge-eng	0.793
forge-mat	0.777
forge-phy	0.761
forge-soc	0.82
forge-s1	0.787
forge-s2	0.783
forge-s3	0.805
forge-s4	0.86
forge-m1	0.82
forge-m2	0.574	0.5577
forge-l	0.242

Possibly wrong checkpoints for M2 and L #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions