Slurm job array fails to run Rscript with shapefiles-CodePudding

I would like to run a job array via Slurm on an HPC cluster, intersecting individual circle shapefiles with a large shapefile of Census blocks, then saving the resulting intersection shapefile. I will then combine these individual shapefiles into one large one on my own machine. This is a way to avoid the parallelization problems I describe in an earlier question: mapply error on list from sf (simple features) object in R

However, when running the job array, I receive the following error:

sbatch: error: Batch job submission failed: Invalid job array specification

Here is a link to the R script, .sh file, and filename csv I am using on my HPC cluster: https://github.com/msghankinson/slurm_job_array.

The R code relies on 3 files:

"buffer" - these are the circle polygons. I've split a large shapefile of 3,086 circles into 3,086 individual shapefiles of 1 circle each (saved in /lustre/ in the "lihtc_bites" folder). The goal for the R script is to intersect 1 circle with the Census blocks in each run of the script, then save that intersection as a shapefile. I will then combine these 3,086 intersection shapefiles into one dataframe on my own laptop. For the reprex, I only include 2 of the 3,086 shapefiles.
"lihtc" - this is a shapefile that I use as an index in my R function. There are 3 versions of this shapefile. Each circle shapefile matches one of these "lihtc" shapefiles. For the reprex, I only include the one shapefile which matches my 2 circle shapefiles.
"blocks" - these are the 710,000 Census blocks. This file remains the same for each run of the R script, regardless of which circle is being used in the intersection. For the reprex, I only include a shapefile of the 7,386 blocks in San Francisco County.

I've run the R code on specific, individual buffer and lihtc shapefiles and the function works. So my main focus is the .sh file launching the job array ("lihtc_array_example.sh"). Here, I am trying to run my R script on each "buffer" shapefile using the task ID and the "master_example.csv" (also in the reprex) to define which files are loaded into R. Each row of master_example.csv contains the buffer filename and the lihtc filename I need. These filenames need to be passed to the R script and used to load the correct files for each intersection. E.g., Task 1 loads files listed on in row 1 of master_example.csv. The code I found tries to pull these names in the .sh file via:

shp_filename=$( echo "$line_N" | cut -d "," -f 2 )
lihtc_filename=$( echo "$line_N" | cut -d "," -f 3 )

While I understand that it is difficult to run the reprex, I would like to know if there are any clear breakdowns in the pipeline between the .sh file, the csv of names, and the R script? I am happy to provide any additional information which may be helpful.

Full .sh file, for ease of access:


#SBATCH -t 2:00:00
#SBATCH -p defq
#SBATCH -N 1
#SBATCH -o jobArrayScript_%A_%a.out
#SBATCH -e jobArrayScript_%A_%a.err
#SBATCH -a 1-308600

line_N=$( awk "NR==$SLURM_ARRAY_TASK_ID" master_example.csv )  # NR means row-# in Awk
shp_filename=$( echo "$line_N" | cut -d "," -f 2 )
lihtc_filename=$( echo "$line_N" | cut -d "," -f 3 )

module load R/4.1.1
module load libudunits2/2.2.28
module load gdal/3.5.0
module load proj/6.3.0
module load geos/3.10.3

Rscript slurm_job_array.R $shp_filename $lihtc_filename

For reference:

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.9   ggmap_3.0.0   ggplot2_3.3.6 sf_1.0-7     

loaded via a namespace (and not attached):
 [1] xfun_0.28           tidyselect_1.1.2    purrr_0.3.4         lattice_0.20-45     colorspace_2.0-3    vctrs_0.4.1         generics_0.1.2     
 [8] htmltools_0.5.2     s2_1.0.7            utf8_1.2.2          rlang_1.0.2         e1071_1.7-9         pillar_1.7.0        glue_1.6.2         
[15] withr_2.5.0         DBI_1.1.1           sp_1.4-6            wk_0.5.0            jpeg_0.1-9          lifecycle_1.0.1     plyr_1.8.7         
[22] stringr_1.4.0       munsell_0.5.0       gtable_0.3.0        RgoogleMaps_1.4.5.3 evaluate_0.15       knitr_1.36          fastmap_1.1.0      
[29] curl_4.3.2          class_7.3-19        fansi_1.0.3         highr_0.9           Rcpp_1.0.8.3        KernSmooth_2.23-20  scales_1.2.0       
[36] classInt_0.4-3      farver_2.1.0        rjson_0.2.20        png_0.1-7           digest_0.6.29       stringi_1.7.6       grid_4.1.0         
[43] cli_3.3.0           tools_4.1.0         bitops_1.0-7        magrittr_2.0.3      proxy_0.4-26        tibble_3.1.7        crayon_1.5.1       
[50] tidyr_1.2.0         pkgconfig_2.0.3     ellipsis_0.3.2      assertthat_0.2.1    rmarkdown_2.11      httr_1.4.2          rstudioapi_0.13    
[57] R6_2.5.1            units_0.7-2         compiler_4.1.0

CodePudding user response：

3 problems identified and now solved:

Max array size refers to the entire array. The throttle just sets how many jobs get scheduled at one time. So I needed to break my 3,086 job task into 4 separate batches. This can be done in the .sh file as: #SBATCH -a 1-999 for job 1 #SBATCH -a 1000-1999 for job 2, and so on.
The R script needs to catch the arguments from the command line. The script now begins: args = commandArgs(trailingOnly=TRUE) shp_filename <- args[1] lihtc_filename <- args[2]
The submission file was sending arguments with quotations, which was preventing paste0 from creating usable file names. Neither noquote() nor print(x, quotes = F) was able to remove these quotes. However gsub('"', '', x) worked.

An inelegant/lazy parallelization on my part, but it works. Case closed.