Skip to content

Parent job in slurm exits before child jobs | nested slurm jobs #315

@ag1805x

Description

@ag1805x

I was trying to setup a small test project to use batchtools on slurm. I am having an issue that the parent job exits from slurm before all the child jobs are completed. How can I solve this issue?

The main Rscript that submits jobs and the associated configuration files are as:

run_batchtools_job.R

library(batchtools)

reg <- makeRegistry(file.dir =  "slurm_registry", seed = 5081, conf.file = "Scripts/batch_tools_test/.batchtools.conf.R")

my_fun <- function(x) {
  Sys.sleep(x)  
  return(x^2)
}

ids <- batchMap(fun = my_fun, x = 100:150, reg = reg)
done <- submitJobs(ids = ids, reg = reg, resources = list(partition = "small", walltime = 86400, memory = 1024, ntasks = 1))
waitForJobs(ids = ids, reg = reg) 
getStatus(ids = ids, reg = reg)    

final_res <- reduceResultsList(ids = ids, reg = reg)
print(class(final_res))

.batchtools.conf.R

cluster.functions <- makeClusterFunctionsSlurm(template = "Scripts/batch_tools_test/slurm_config.tmpl", 
                                               array.jobs = TRUE, 
                                               scheduler.latency = 60,
                                               fs.latency = 30)
max.concurrent.jobs <- 5

slurm_config.tmpl

#!/bin/bash
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --ntasks=<%= resources$ntasks %>
#SBATCH --mem=<%= resources$memory %>MB
#SBATCH --partition=<%= resources$partition %>

module load  r/4.3.3

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

I submit the run_batchtools_job.R script to slurm using the following sbatch script.

run_batchtools.sh

#!/bin/bash
#SBATCH --job-name=batchtools_test
#SBATCH --output=batchtools_test.log
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --mem=2G
#SBATCH --partition=small

# Load R
module load  r/4.3.3

# Run your R script
Rscript Scripts/batch_tools_test/run_batchtools_job.R

I observed that the batchtools_test job exits before all the child jobs spawned using submitJobs end. As a result, there is nothing in final_res.

While checking getErrorMessages, I saw that several jobs are listed as 'not terminated'. But when I manually checked the logs and the results within the registry directories, everything completed as expected.

How can I overcome this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions