Home > Blockchain >  How to loop over multiple folders to concatenate FastQ files?
How to loop over multiple folders to concatenate FastQ files?

Time:02-26

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.

cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz

So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing. My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file? Surely, I cannot just concatenate one by one by going into each sample folder.

Please give some helpful tips. Thank you.
The folder structure is the following:

/data/Sample_1/....._525_1_fq.gz    /....._525_2_fq.gz    /....._526_1_fq.gz        /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz    /....._580_2_fq.gz    /....._589_1_fq.gz        /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz    /....._690_2_fq.gz    /....._645_1_fq.gz        /....._645_2_fq.gz

Below I have attached a screenshot of the folder structure.

Folder structure

CodePudding user response:

Based on the provided file structure, would you please try:

#!/bin/bash

for d in Raw2/C*/; do
(
    cd "$d"
    id=${d%/}; id=${id##*/}             # extract ID from the directory name
    cat V*_1.fq.gz > "${id}_R1.fq.gz"
    cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
  • The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
  • The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
  • The variable id is assigned to the ID extracted from the directory name.
  • cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.
  • Related