CrabNet matbench results - possibly neglecting 25% of the training data it could have used

@anthony-wang,

In the [CrabNet matbench notebook](https://github.com/materialsproject/matbench/blob/main/benchmarks/matbench_v0.1_CrabNet/notebook.ipynb), it does train/val/test splits. However, if https://github.com/anthony-wang/CrabNet/issues/15#issuecomment-925519683 is correct such that the validation data (i.e. `val.csv`) doesn't contribute to hyperparameter tuning, then that 25% of the training data is essentially getting thrown away, correct?

In other words, the CrabNet results are based on only 75% of the training data compared to what the other `matbench` models use for training. From what I understand, the train/val/test split in the context of `matbench` only really makes sense if you're doing hyperparameter optimization in a nested CV scheme, as follows:
<img src=https://hackingmaterials.lbl.gov/automatminer/_images/cv_nested.png width=600>
(Source: https://hackingmaterials.lbl.gov/automatminer/advanced.html)

To correct this, I think all that needs to be done is change:
```python
#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
  df = df.sample(frac = 1.0, random_state = 7)
  val_df = df.sample(frac = 0.25, random_state = 7)
  train_df = df.drop(val_df.index)

  return train_df, val_df
```
to
```python
#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
  train_df = df.sample(frac = 1.0, random_state = 7)
  val_df = df.sample(frac = 0.25, random_state = 7)

  return train_df, val_df
```
which makes it so there's data bleeding between `train_df` and `val_df`, but `val_df` ends up being essentially just a dummy dataset so that CrabNet doesn't error out when a `val.csv` isn't available.

Sterling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrabNet matbench results - possibly neglecting 25% of the training data it could have used #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CrabNet matbench results - possibly neglecting 25% of the training data it could have used #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions