OpenML Scikit-Learn Workflows

The data for 475,297 machine learning runs provided in the .csv file were directly derived from https://www.openml.org/.

In relation to our HILDA paper, a run corresponds to a single row in the .csv file, and a sequence is a groupby on user_id and task_id.

The columns in the .csv are as follows:

rid: Run id on OpenML
user_id: User id on OpenML
task_id: Task id on OpenML
auc: Area under ROC curve of the run
dist_from_mean_auc: Relative performance--difference between the AUC of the run and the mean AUC of all the runs for the same task_id
model: Set of Scikit-Learn classifiers/estimators and model wrappers used in the run
model_params: Set of model hyperparameters represented as (parameter name, parameter value) tuples
ppr: Set of preprocessing operators (note that "set()" means that no preprocessing was used)
ppr_params: Set of preprocessing hyperparameters represented as (parameter name, parameter value) tuples
iter: Iteration of the run in the sequence (starting from 1)
change_type: Type of change from the previous iteration:
- 'S': Starting iteration
- 'M': Model operator change
- 'P': Preprocessing operator change
- 'H': Model Hyperparameter change
- 'R': Preprocessing hyperparameter change
- 'C': Combination of model and preprocessing changes (operator or hyperparameter)
- 'N': No change
delta_auc: Change in AUC from the previous iteration
start_time: Start time of the run
time_delta_in_mins: Difference between the start time of the current iteration with the previous iteration

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback