I am currently working on RNA-Seq data, and I have a directory containing the forward and the reverse sequences of a number of samples. I want to run tools such as SortMeRNA, but to do this I need the file names of both the forward and the reverse sequence as the data is pair-ended.
My directory looks akin to like this:
data/expression/samples/K1-01_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-01_sortmerna_trimmomatic_2.fq.gz
data/expression/samples/K1-02_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-02_sortmerna_trimmomatic_2.fq.gz
data/expression/samples/K1-03_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-03_sortmerna_trimmomatic_2.fq.gz
data/expression/samples/K1-04_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-04_sortmerna_trimmomatic_2.fq.gz
...
data/expression/samples/K1-20_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-20_sortmerna_trimmomatic_2.fq.gz
What I want to do is to select the files in pairs and assign them to a variable that I can then pass on to the selected software without having to make a variable for each file.
I want the code to work by giving me $FWD
and $REV
as the file names K1-01_sortmerna_trimmomatic_1.fq.gz
and K1-01_sortmerna_trimmomatic_2.fq.gz
.
The next time it iterates over the directory to give $FWD
and $REV
as K1-02_sortmerna_trimmomatic_1.fq.gz
and K1-02_sortmerna_trimmomatic_2.fq.gz
respectively.
I have made this code, which is probably not a very efficient way of dealing with this problem (and that do not work).
DATA_LOCATION=data/expression/samples/
cd $DATA_LOCATION
files=(*.fq.gz)
total=${#files[@]}
idx=0
FWD_DONE=false
REV_DONE=false
for file in "${files[@]:idx}"; do
if [ !$FWD_DONE ]; then
idx=$(( idx 1 ))
FWD=$(basename $file)[$idx]
echo $FWD
FWD_DONE=true
REV_DONE=false
fi
if [ !$REV_DONE ] && [ $FWD_DONE ]; then
idx=$(( idx 1 ))
REV=$(basename $file)[$idx]
echo $REV
REV_DONE=true
FWD_DONE=false
fi
echo index $idx
done
Unfortunately, this makes the $FWD
and $REV
variables the same for each pass. My guess is that it has something to do with the for
statement not updating with the index increment inside. Unfortunately I am very new to shell scripting, and I have yet to find any other sources that has any help.
Any help with this would be greatly appreciated! I am more than willing to trash my own code if it means that the whole process can become simpler.
CodePudding user response:
Once cd
ed to the directory containing the data files, does this code do what you want?
for fwd in *_1.fq.gz; do
rev=${fwd%_1.fq.gz}_2.fq.gz
# CODE THAT USES "$fwd" AND "$rev" GOES HERE
done
- See Correct Bash and shell script variable capitalization for an explanation of why I replaced
FWD
andREV
withfwd
andrev
.