Home > Blockchain >  How to select two files from a directory, and always every other file?
How to select two files from a directory, and always every other file?

Time:05-07

I am currently working on RNA-Seq data, and I have a directory containing the forward and the reverse sequences of a number of samples. I want to run tools such as SortMeRNA, but to do this I need the file names of both the forward and the reverse sequence as the data is pair-ended.

My directory looks akin to like this:

data/expression/samples/K1-01_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-01_sortmerna_trimmomatic_2.fq.gz
data/expression/samples/K1-02_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-02_sortmerna_trimmomatic_2.fq.gz
data/expression/samples/K1-03_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-03_sortmerna_trimmomatic_2.fq.gz
data/expression/samples/K1-04_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-04_sortmerna_trimmomatic_2.fq.gz
...
data/expression/samples/K1-20_sortmerna_trimmomatic_1.fq.gz
data/expression/samples/K1-20_sortmerna_trimmomatic_2.fq.gz

What I want to do is to select the files in pairs and assign them to a variable that I can then pass on to the selected software without having to make a variable for each file.

I want the code to work by giving me $FWD and $REV as the file names K1-01_sortmerna_trimmomatic_1.fq.gz and K1-01_sortmerna_trimmomatic_2.fq.gz.

The next time it iterates over the directory to give $FWD and $REV as K1-02_sortmerna_trimmomatic_1.fq.gz and K1-02_sortmerna_trimmomatic_2.fq.gz respectively.

I have made this code, which is probably not a very efficient way of dealing with this problem (and that do not work).

DATA_LOCATION=data/expression/samples/
cd $DATA_LOCATION
files=(*.fq.gz)
total=${#files[@]}
idx=0

FWD_DONE=false
REV_DONE=false

for file in "${files[@]:idx}"; do

    if [ !$FWD_DONE ]; then
        idx=$(( idx   1 ))
        FWD=$(basename $file)[$idx]
        echo $FWD
        FWD_DONE=true
        REV_DONE=false
    fi

    if [ !$REV_DONE ] && [ $FWD_DONE ]; then
        idx=$(( idx   1 ))
        REV=$(basename $file)[$idx]
        echo $REV
        REV_DONE=true
        FWD_DONE=false
    fi

    echo index $idx
done

Unfortunately, this makes the $FWD and $REV variables the same for each pass. My guess is that it has something to do with the for statement not updating with the index increment inside. Unfortunately I am very new to shell scripting, and I have yet to find any other sources that has any help.

Any help with this would be greatly appreciated! I am more than willing to trash my own code if it means that the whole process can become simpler.

CodePudding user response:

Once cded to the directory containing the data files, does this code do what you want?

for fwd in *_1.fq.gz; do
    rev=${fwd%_1.fq.gz}_2.fq.gz
    # CODE THAT USES "$fwd" AND "$rev" GOES HERE
done
  • Related