Home > Mobile >  Bash - running a loop on multiple subdirectories and running a command on files within the subdirect
Bash - running a loop on multiple subdirectories and running a command on files within the subdirect

Time:10-29

I am trying to run the programme Unicycler on multiple sets of fasta.gz files. A set of three fasta.gz files is required for each assembly, each set of three fasta.gz files has a common ID which are located in a unique sub-directory (containing the same corresponding common ID in the name).

For example, the three files: QC_141696.fastq.gz, QC_141696_1.fastq.gz, QC_141696_2.fastq.gz are required to run the assembly and are located in the subdirectory assem_141696. I have 10 more sets of 3 files organised in the same way; all 11 subdirectories with named assem_(ID) and are located in the parent directory Assemblies.

Sequencing/Assemblies/assem_(IDset1)/QC_(IDset1).fastq.gz
Sequencing/Assemblies/assem_(IDset1)/QC_(IDset1)_1.fastq.gz
Sequencing/Assemblies/assem_(IDset1)/QC_(IDset1)_2.fastq.gz

An example of the command I am trying to run, not within a loop is:

unicycler --short1 QC_141696.fastq.gz --short2 QC_141696_2.fastq.gz --long QC_141696.fastq.gz --out QC_141696_hybrid --threads 16

I want to loop through each of the assem_(IDset*) subdirectories and run Unicycler using the three files located within it, the output directory should be located in the relevant assem_(IDset*) subdirectory

This is the code that I have so far:

for file in Assemblies/assem*/*_1.fastq.gz;
    do base=$(basename ${file} _1.fastq.gz)
     echo "running unicycler hybrid assembly on ${base}"
     unicycler --short1 ${base}_1.fastq.gz --short2 ${base}_2.fastq.gz --long ${base}.fastq.gz --out ${base}_hybridassem --threads 16
     echo "unicycler assembly on ${base} finished"
    
done

I am running the code from within the Sequencing directory

But I get:

Error: could not find home/user/scratch/Sequencing/QC_181651_1.fastq.gz 

So it seems that my code is not looping through the intended directories. Annoyingly it works fine when testing it with echo.

Any help would be greatly appreciated!

CodePudding user response:

Your code running in the Sequencing directory will need to build the paths to the input and output files for each of the assem_(IDset*) subdirectories. You can use bash parameter expansion dir=${file%\/*} to extract the directory in your loop. (Note also that the base variable was renamed to id):

#!/bin/bash

for file in Assemblies/assem*/*_1.fastq.gz ; do 
    id=$(basename "${file}" _1.fastq.gz)
    dir=${file%\/*}
    echo "running unicycler hybrid assembly on ${id}"
    unicycler --short1 "${dir}/${id}_1.fastq.gz" --short2 "${dir}/${id}_2.fastq.gz" --long "${dir}/${id}.fastq.gz" --out "${dir}/${id}_hybridassem" --threads 16
    echo "unicycler assembly on ${id} finished"
done
  • Related