I have a nextflow script with a channel for paired file inputs. I am trying to extract a substring from the file inputs to use as part of the shell call. I am trying to use Groovy's regex matching to extract the substring, but since it is based on an input value, I am having trouble executing the matching. An alternative would be to perform the regex in bash as part of the process shell call, but I am interested in figuring out how to manipulate inputs within a process anyways, as I feel it would be useful for other things too. How can I perform intermediate Groovy code with the process inputs prior to the shell call?
process alignment {
input:
val files
output:
stdout
def matcher = "${files[1][0]}" =~ /. \/bcl2fastq_out\/([^\/] )\/. /
# this is the culprit, as if I hardcode the first string it works
def project = matcher.findAll()[0][1]
"""
echo ${project}
"""
}
workflow {
files = Channel
.fromFilePairs("${params.out_dir}/**{_R1,_R2}_00?.fastq.gz", checkIfExists:true, size: 2)
alignment(files)
}
when I execute this, I get the error
No such variable: files
an example input string would look like extractions/test/bcl2fastq_out/project1/example_L001_R1_001.fastq.gz
where I'm trying to extract the project1
substring
CodePudding user response:
I figured it out, if instead of just jumping into the shell script with the triple quotes, you can start specifying the process execution script with "script:" then run Groovy using the process inputs
process alignment {
input:
val files
output:
stdout
script:
test = (files[1][0] =~ '. /test/([^/] )/. ').findAll()[0][1]
"""
echo $test
"""
CodePudding user response:
As you've already discovered, you can declare variables in the script block. Note that these are global (within the process scope) unless you define them using the def
keyword. If you don't need these elsewhere in your process definition, like in your example, a local variable (using def
) is usually preferable. If, however, you need to access these in your output declaration, for example, then they will need to be global.
Note that the fromFilePairs factory method emits a tuple, where the first element is a group key and the second element is a list of files. The problem with just using val
to declare the inputs is that the files in the second element will not be localized to the working directory when your script is run. To fix this, you can just change your input definition to something like:
input:
tuple val(sample), path(fastq_files)
The problem with this approach, is that we're unable to extract the parent directory name from the localized filenames. So you will need to pass this in somehow. Usually, you could just get the parent name from the first file in the tuple, using:
params.input_dir = './path/to/files'
params.pattern = '**_R{1,2}_00?.fastq.gz'
process alignment {
debug true
input:
tuple val(sample), val(project), path(fastq_files)
"""
echo "${sample}: ${project}: ${fastq_files}"
"""
}
workflow {
Channel
.fromFilePairs( "${params.input_dir}/${params.pattern}" )
.map { sample, reads ->
def project = reads[0].parent.name
tuple( sample, project, reads )
}
.set { reads }
alignment( reads )
}
But since the glob pattern has an additional wildcard, i.e. _00?
, you may not necessarily get the results you expect. For example:
$ mkdir -p path/to/files/project{1,2,3}
$ touch path/to/files/project1/sample1_R{1,2}_00{1,2,3,4}.fastq.gz
$ touch path/to/files/project2/sample2_R{1,2}_00{1,2,3,4}.fastq.gz
$ touch path/to/files/project3/sample3_R{1,2}_00{1,2,3,4}.fastq.gz
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [determined_roentgen] DSL2 - revision: f80ab33ac8
executor > local (12)
[a8/9235cc] process > alignment (12) [100%] 12 of 12 ✔
sample2: project2: sample2_R1_001.fastq.gz sample2_R1_004.fastq.gz
sample1: project1: sample1_R1_003.fastq.gz sample1_R2_001.fastq.gz
sample1: project1: sample1_R1_004.fastq.gz sample1_R2_003.fastq.gz
sample3: project3: sample3_R1_001.fastq.gz sample3_R2_001.fastq.gz
sample1: project1: sample1_R1_001.fastq.gz sample1_R2_004.fastq.gz
sample1: project1: sample1_R1_002.fastq.gz sample1_R2_002.fastq.gz
sample2: project2: sample2_R1_002.fastq.gz sample2_R2_002.fastq.gz
sample2: project2: sample2_R2_001.fastq.gz sample2_R2_004.fastq.gz
sample2: project2: sample2_R1_003.fastq.gz sample2_R2_003.fastq.gz
sample3: project3: sample3_R2_002.fastq.gz sample3_R2_004.fastq.gz
sample3: project3: sample3_R1_003.fastq.gz sample3_R1_004.fastq.gz
sample3: project3: sample3_R1_002.fastq.gz sample3_R2_003.fastq.gz
Fortunately, we can supply a custom file pair grouping strategy using a closure. This uses the readPrefix helper function:
workflow {
Channel
.fromFilePairs( "${params.input_dir}/${params.pattern}" ) { file ->
prefix = Channel.readPrefix(file, params.pattern)
suffix = file.simpleName.tokenize('_').last()
"${file.parent.name}/${prefix}_${suffix}"
}
.map { key, reads ->
def (project, sample) = key.tokenize('/')
tuple( sample, project, reads )
}
.set { reads }
alignment( reads )
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [loving_cantor] DSL2 - revision: 5a76ac712f
executor > local (12)
[f4/74edbc] process > alignment (12) [100%] 12 of 12 ✔
sample1_003: project1: sample1_R1_003.fastq.gz sample1_R2_003.fastq.gz
sample2_002: project2: sample2_R1_002.fastq.gz sample2_R2_002.fastq.gz
sample1_002: project1: sample1_R1_002.fastq.gz sample1_R2_002.fastq.gz
sample2_003: project2: sample2_R1_003.fastq.gz sample2_R2_003.fastq.gz
sample2_004: project2: sample2_R1_004.fastq.gz sample2_R2_004.fastq.gz
sample2_001: project2: sample2_R1_001.fastq.gz sample2_R2_001.fastq.gz
sample1_001: project1: sample1_R1_001.fastq.gz sample1_R2_001.fastq.gz
sample1_004: project1: sample1_R1_004.fastq.gz sample1_R2_004.fastq.gz
sample3_001: project3: sample3_R1_001.fastq.gz sample3_R2_001.fastq.gz
sample3_004: project3: sample3_R1_004.fastq.gz sample3_R2_004.fastq.gz
sample3_002: project3: sample3_R1_002.fastq.gz sample3_R2_002.fastq.gz
sample3_003: project3: sample3_R1_003.fastq.gz sample3_R2_003.fastq.gz