Skip to content

CrabNet matbench results - possibly neglecting 25% of the training data it could have used #19

@sgbaird

Description

@sgbaird

@anthony-wang,

In the CrabNet matbench notebook, it does train/val/test splits. However, if #15 (comment) is correct such that the validation data (i.e. val.csv) doesn't contribute to hyperparameter tuning, then that 25% of the training data is essentially getting thrown away, correct?

In other words, the CrabNet results are based on only 75% of the training data compared to what the other matbench models use for training. From what I understand, the train/val/test split in the context of matbench only really makes sense if you're doing hyperparameter optimization in a nested CV scheme, as follows:

(Source: https://hackingmaterials.lbl.gov/automatminer/advanced.html)

To correct this, I think all that needs to be done is change:

#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
  df = df.sample(frac = 1.0, random_state = 7)
  val_df = df.sample(frac = 0.25, random_state = 7)
  train_df = df.drop(val_df.index)

  return train_df, val_df

to

#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
  train_df = df.sample(frac = 1.0, random_state = 7)
  val_df = df.sample(frac = 0.25, random_state = 7)

  return train_df, val_df

which makes it so there's data bleeding between train_df and val_df, but val_df ends up being essentially just a dummy dataset so that CrabNet doesn't error out when a val.csv isn't available.

Sterling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions