Skip to content

Possibly wrong checkpoints for M2 and L #2

@jglaser

Description

@jglaser

I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.

These are my results

Model SciQ PIQA
forge-bio 0.788  
forge-che 0.821  
forge-eng 0.793  
forge-mat 0.777  
forge-phy 0.761  
forge-soc 0.82  
forge-s1 0.787  
forge-s2 0.783  
forge-s3 0.805  
forge-s4 0.86  
forge-m1 0.82  
forge-m2 0.574 0.5577
forge-l 0.242  

The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (data/eval/forge-m2) suggests that these are roughly the scores of the m2 checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.

I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.

Command line

 lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions