I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz
and R2.fastq.gz
files for one sample. So, I used the following code for concatenating all the R1.fastq.gz
and R2.fastq.gz
into a single R1.fastq.gz
and R2.fastq.gz
.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V
has different number then L
with different number and then another string of digits before the _1
and _2
. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz
files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
CodePudding user response:
Based on the provided file structure, would you please try:
#!/bin/bash
for d in Raw2/C*/; do
(
cd "$d"
id=${d%/}; id=${id##*/} # extract ID from the directory name
cat V*_1.fq.gz > "${id}_R1.fq.gz"
cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
- The syntax
for d in Raw2/C*/
loops over the subdirectories starting withC
. - The parentheses make the inner commands executed in a subshell so we don't have to care about returning from
cd "$d"
(at the expense of small extra execution time). - The variable
id
is assigned to the ID extracted from the directory name. cat V*_1.fq.gz
, for example, will be expanded asV350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz
V350028825_L04_583_1.fq.gz
... according to the files in the directory and are concatenated into${id}_R1.fastq.gz
. Same for${id}_R2.fastq.gz
.