I am trying to run the programme Unicycler on multiple sets of fasta.gz files. A set of three fasta.gz files is required for each assembly, each set of three fasta.gz files has a common ID which are located in a unique sub-directory (containing the same corresponding common ID in the name).
For example, the three files: QC_141696.fastq.gz, QC_141696_1.fastq.gz, QC_141696_2.fastq.gz are required to run the assembly and are located in the subdirectory assem_141696. I have 10 more sets of 3 files organised in the same way; all 11 subdirectories with named assem_(ID) and are located in the parent directory Assemblies.
Sequencing/Assemblies/assem_(IDset1)/QC_(IDset1).fastq.gz
Sequencing/Assemblies/assem_(IDset1)/QC_(IDset1)_1.fastq.gz
Sequencing/Assemblies/assem_(IDset1)/QC_(IDset1)_2.fastq.gz
An example of the command I am trying to run, not within a loop is:
unicycler --short1 QC_141696.fastq.gz --short2 QC_141696_2.fastq.gz --long QC_141696.fastq.gz --out QC_141696_hybrid --threads 16
I want to loop through each of the assem_(IDset*) subdirectories and run Unicycler using the three files located within it, the output directory should be located in the relevant assem_(IDset*) subdirectory
This is the code that I have so far:
for file in Assemblies/assem*/*_1.fastq.gz;
do base=$(basename ${file} _1.fastq.gz)
echo "running unicycler hybrid assembly on ${base}"
unicycler --short1 ${base}_1.fastq.gz --short2 ${base}_2.fastq.gz --long ${base}.fastq.gz --out ${base}_hybridassem --threads 16
echo "unicycler assembly on ${base} finished"
done
I am running the code from within the Sequencing directory
But I get:
Error: could not find home/user/scratch/Sequencing/QC_181651_1.fastq.gz
So it seems that my code is not looping through the intended directories. Annoyingly it works fine when testing it with echo.
Any help would be greatly appreciated!
CodePudding user response:
Your code running in the Sequencing
directory will need to build the paths to the input and output files for each of the assem_(IDset*) subdirectories. You can use bash parameter expansion dir=${file%\/*}
to extract the directory in your loop. (Note also that the base
variable was renamed to id
):
#!/bin/bash
for file in Assemblies/assem*/*_1.fastq.gz ; do
id=$(basename "${file}" _1.fastq.gz)
dir=${file%\/*}
echo "running unicycler hybrid assembly on ${id}"
unicycler --short1 "${dir}/${id}_1.fastq.gz" --short2 "${dir}/${id}_2.fastq.gz" --long "${dir}/${id}.fastq.gz" --out "${dir}/${id}_hybridassem" --threads 16
echo "unicycler assembly on ${id} finished"
done