Skip to content

cannot run varimpact with multicore parallelisation #20

Description

@DS-Rodrigues

Hi Chris,

It seems that for me at least multicore parallelisation is not working with varimpact, but it might be I am doing something wrong. varimpact now seems to work with my current library (including learners with different parameters) but without parallelisation it has been more than 24hours to run just 2 folds CV and it is still running... I tried to create an example, please see below:

I am using macOS Big Sur, MacBook Air (M1, 2020), 16 GB RAM, 8 cores
R version 4.1.0

# dataset
my_outcome <- runif(300, min=0, max=500)
var_1 <- as.numeric(1:300)
var_2 <- runif(300)
var_3 <- runif(300, min=0, max=1)
var_4 <- as.factor(c(rep("0",30),rep("1",270)))
var_5 <- as.factor(c(rep("0",25),rep("1",275)))
my_dataset <- data.frame(var_1, var_2, var_3, var_4, var_5)
str(my_dataset)

# all libraries I am loading
library(tidyverse)
library(readxl)
library(openxlsx)
library(lubridate)
library(data.table)
library(optiRum)
library(gridExtra)
library(RColorBrewer)
library(anytime)
library(foreign)
library(ggplot2)
library(MASS)
library(Hmisc)
library(reshape2)
library(utils)
library(zoo)
library(lme4)
library(broom)
library(stats)
library(factoextra)
library(cluster)
library(pscl)
library(SuperLearner)
library(quadprog)
library(earth)
library(tmle)
library(xgboost)
library(randomForest)
library(ranger)
library(glmnet)
library(nnet)
library(kernlab)
library(KernelKnn)
library(varimpact)
library(hopach)
library(dbarts)
library(arm)
library(gam)

# Hyperparameter optimisation

# Fit elastic net with 5 different alphas: 0, 0.2, 0.4, 0.6, 0.8, 1.0.
learners_glmnet = create.Learner("SL.glmnet", detailed_names = TRUE,
                                tune = list(alpha = seq(0, 1, length.out = 5)))

# 5 configurations
learners_glmnet$names

# Random forest via ranger
learners_ranger = create.Learner("SL.ranger", detailed_names = TRUE,
                                 tune = list(mtry = c(1), ntree = c(1000), nodesize=c(1,5,10)),
                                 name_prefix = "rgr")

# 3 configurations
learners_ranger$names

# XGBoost
learners_xgboost = create.Learner("SL.xgboost", detailed_names = TRUE,
                                  tune = list(ntrees = c(500, 1000), max_depth = c(2, 4), shrinkage = c(0.001, 0.01)),
                                  name_prefix = "xgb")

# 8 configurations
learners_xgboost$names

# earth - Multivariate adaptive regression splines
learners_earth = create.Learner("SL.earth", detailed_names = TRUE,
                              tune = list(degree = c(1,2)))

# 2 configurations
learners_earth$names

# svm - support vector machine
learners_svm = create.Learner("SL.ksvm", detailed_names = TRUE,
                              tune = list(kernel = c("rbfdot", "polydot"), C = c(0.01, 0.1, 1, 10, 100)),
                              name_prefix = "svm")

# 10 configurations
learners_svm$names

# SL library:
Q_lib <- c("SL.mean", "SL.glm", "SL.bayesglm", learners_earth$names,
           learners_glmnet$names, "tmle.SL.dbarts2",
           "SL.rpartPrune", learners_ranger$names, learners_svm$names,
           learners_xgboost$names)

g_lib <- c("SL.mean", "SL.glm", "SL.bayesglm", learners_earth$names,
           learners_glmnet$names, "tmle.SL.dbarts2",
           "SL.rpartPrune", learners_ranger$names, learners_svm$names,
           learners_xgboost$names)


library(future)
plan("multisession")

vim <- varimpact(Y = my_outcome, data = my_dataset, Q.library = Q_lib, g.library = g_lib, family = "gaussian",
                 adjust_cutoff = NULL, V = 2)

As a result I get:

Finished pre-processing variables.

Processing results:
- Factor variables: 1 
- Numeric variables: 3 

Estimating variable importance for 1 factors.
Error estimating g using SuperLearner. Defaulting to glm
Error estimating g using SuperLearner. Defaulting to glm
Error estimating g using SuperLearner. Defaulting to glm
Error estimating g using SuperLearner. Defaulting to glm


Estimating variable importance for 3 numerics.
Error estimating g using SuperLearner. Defaulting to glm
Error in training_estimates[[bin_j]] : subscript out of bounds
In addition: Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
3: glm.fit: algorithm did not converge 
4: glm.fit: fitted probabilities numerically 0 or 1 occurred 
5: glm.fit: algorithm did not converge 
6: glm.fit: fitted probabilities numerically 0 or 1 occurred 
7: glm.fit: algorithm did not converge 
8: glm.fit: fitted probabilities numerically 0 or 1 occurred 
9: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. 

I closed my R session and started again and obtained the same result via "snow":

library(RhpcBLASctl)
library(future)

cl = parallel::makeCluster(get_num_cores())
plan("cluster", workers = cl)

vim <- varimpact(Y = my_outcome, data = my_dataset, Q.library = Q_lib, g.library = g_lib, family = "gaussian",
                 adjust_cutoff = NULL, V = 2)

Any ideas on what might be happening?

I think this is related with plan("multisession"). If I run plan("multicore"), it does not give me those error messages, but I am not sure if it is doing anything. Also, if I run plan("multiprocess"), I get the following message:

Warning messages:
1: Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead, explicitly specify either 'multisession' or 'multicore'. In the current R session, 'multiprocess' equals 'multisession'. 
2: In supportsMulticoreAndRStudio(...) :
  [ONE-TIME WARNING] Forked processing ('multicore') is not supported when running R from RStudio because it is considered unstable. For more details, how to control forked processing or not, and how to silence this warning in future R sessions, see ?parallelly::supportsMulticore

This whole problem might be related to the future package. I was wondering if there is a way to pass parallel = "multicore" as an argument to varimpact, similarly to how we do for CV.SuperLearner? That way of doing parallelisation seems to be working fine. With that in mind, I tried to change tmle_estimate_q.R line 118 replacing SuperLearner::SuperLearner by SuperLearner::mcSuperLearner and same for tmle_estimate_g.R line 78, and then I run these two R scripts in my computer after loading varimpact. I did not get any error message but not sure if it is working. Any advice?

Once again, thanks very much for your input on this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions