@anthony-wang,
In the CrabNet matbench notebook, it does train/val/test splits. However, if #15 (comment) is correct such that the validation data (i.e. val.csv) doesn't contribute to hyperparameter tuning, then that 25% of the training data is essentially getting thrown away, correct?
In other words, the CrabNet results are based on only 75% of the training data compared to what the other matbench models use for training. From what I understand, the train/val/test split in the context of matbench only really makes sense if you're doing hyperparameter optimization in a nested CV scheme, as follows:

(Source: https://hackingmaterials.lbl.gov/automatminer/advanced.html)
To correct this, I think all that needs to be done is change:
#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
df = df.sample(frac = 1.0, random_state = 7)
val_df = df.sample(frac = 0.25, random_state = 7)
train_df = df.drop(val_df.index)
return train_df, val_df
to
#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
train_df = df.sample(frac = 1.0, random_state = 7)
val_df = df.sample(frac = 0.25, random_state = 7)
return train_df, val_df
which makes it so there's data bleeding between train_df and val_df, but val_df ends up being essentially just a dummy dataset so that CrabNet doesn't error out when a val.csv isn't available.
Sterling
@anthony-wang,
In the CrabNet matbench notebook, it does train/val/test splits. However, if #15 (comment) is correct such that the validation data (i.e.
val.csv) doesn't contribute to hyperparameter tuning, then that 25% of the training data is essentially getting thrown away, correct?In other words, the CrabNet results are based on only 75% of the training data compared to what the other

matbenchmodels use for training. From what I understand, the train/val/test split in the context ofmatbenchonly really makes sense if you're doing hyperparameter optimization in a nested CV scheme, as follows:(Source: https://hackingmaterials.lbl.gov/automatminer/advanced.html)
To correct this, I think all that needs to be done is change:
to
which makes it so there's data bleeding between
train_dfandval_df, butval_dfends up being essentially just a dummy dataset so that CrabNet doesn't error out when aval.csvisn't available.Sterling