Concatenating many text files based on intermediate directory names-CodePudding

I would like to simply concatenate all contents of a set of files into a single new file. Each file may be identified by either a shared file name, or folder name. I see many examples of performing this task while files are in the same directory, but not where they are spread across sub directories.

Input

my_project
 -- housecat
|    -- 1234
|   |   -- 1234_contigs.fasta
|    -- 1290
|   |   -- 1290_contigs.fasta
 -- jaguar
|    -- 1234
|   |   -- 1234_contigs.fasta
|    -- 4567
|   |   -- 4567_contigs.fasta
|    -- 9876
|   |   -- 9876_contigs.fasta
 -- puma
|    -- 0987
|   |   -- 0987_contigs.fasta
|    -- 1029
|   |   -- 1029_contigs.fasta
|    -- 1234
|   |   -- 1234_contigs.fasta
|    -- 4567
|   |   -- 4567_contigs.fasta

an 'example' of the output files.

mkdir -p concats/1234
cat puma/1234/1234_contigs.fasta jaguar/1234/1234_contigs.fasta housecat/1234/1234_contigs.fasta >> concats/1234/1234_concat.fasta
more concats/1234_concat.fasta

minimally reproducible contents of housecat 1234
minimally reproducible contents of jaguar 1234
minimally reproducible contents of puma 1234

I would like this action to be performed for each of these types of files, even if their is only one such file (e.g. 1029 & 1290.fasta). I see that I can copy each of the files into a directory, and concatenate from there - but I would like to avoid that. Is this possible or should I just continue along the path of renaming the files, placing into the same folder, and combining them there?

DESIRED OUTPUT (not showing contents of all sub directories)

my_project
 -- concats
|    -- 0987/0987_concat.fasta
|    -- 1029/1029_concat.fasta
|    -- 1234/1234_concat.fasta
|    -- 4567/4567_concat.fasta
|    -- 1290/1290_concat.fasta
 -- jaguar
 -- housecat
 -- puma

what i have so far:

FILENAME=$(find . -print | grep -E '[0-9]{3,4}_contigs.fasta') # this due to many many non-target files being present. I can move it into the script later just do not want to much focus on this.

for i in $FILENAME; do
  FILE=$(basename "$i" | sed 's/_contigs//g')
  DIR=concats/${FILE%.*}
  ORGANISM=$(echo $i | cut -d/ -f 2)
  mkdir -p -- "$DIR"
  cp $i "${DIR}/${ORGANISM}_${FILE}" # rename the files here
done

for d in concats/*/ ; do
    LOCI=$(echo $d | cut -d/ -f 2)
    echo $d* > ${d}${LOCI}_concat.fasta
done

I was wondering if before I run the second loop I could use a cat like command to combine these files? Or whether I need to move them to destination and them combine them? Mostly just curious if I can avoid copying the files.

CodePudding user response：

Unless I'm missing something in the requirement there should be no need to first copy/rename the files before concatenating them into their final destination (and then deleting the copied/renamed files).

Finding the list of *_contigs.fasta files and sorting by the (numeric) subdirectory and then by the parent directory:

$ cd my_project
$ find . -type f -name "*_contigs.fasta" | sort -t'/' -k4,4n -k1,1
./puma/0987/0987_contigs.fasta
./puma/1029/1029_contigs.fasta
./housecat/1234/1234_contigs.fasta
./jaguar/1234/1234_contigs.fasta
./puma/1234/1234_contigs.fasta
./housecat/1290/1290_contigs.fasta
./jaguar/4567/4567_contigs.fasta
./puma/4567/4567_contigs.fasta
./jaguar/9876/9876_contigs.fasta

One bash idea:

cd my_project
'rm' -rf concats
mkdir concats

while read -r fname
do
    IFS='/' read -r _ _ num _ <<< "${fname}"

    newdir="concats/${num}"
    newfile="${newdir}/${num}_concat.fasta"

    [[ ! -d "${newdir}" ]] && mkdir -p "${newdir}"

    cat "${fname}" >> "${newfile}"

done < <(find . -type f -name "*_contigs.fasta" | sort -t'/' -k4,4n -k1,1)

The result:

$ find concats -type f -name "*.fasta"
concats/0987/0987_concat.fasta
concats/1029/1029_concat.fasta
concats/1234/1234_concat.fasta
concats/1290/1290_concat.fasta
concats/4567/4567_concat.fasta
concats/9876/9876_concat.fasta

For a large number of files this bash solution (above) may be slow if simply due to a) the number of iterations through the loop and b) the number of times each *_concat.fasta file has to be opened/closed (one such file has to be opened/closed for each pass through the loop).

One awk idea that should be faster:

cd my_project
'rm' -rf concats
mkdir concats

mapfile -t flist < <(find . -type f -name "*_contigs.fasta" | sort -t'/' -k4,4n -k1,1)

awk '
FNR==1 { split(FILENAME,a,"/")
         newnum=a[3]

         if (newnum != oldnum) {
            close(outfile)
            system("mkdir -p concats/" newnum )
         }

         oldnum=newnum
         outfile="concats/" newnum "/" newnum "_concat.fasta"
       }
       { print > outfile }
' "${flist[@]}"

The result:

$ find concats -type f -name "*.fasta"
concats/0987/0987_concat.fasta
concats/1029/1029_concat.fasta
concats/1234/1234_concat.fasta
concats/1290/1290_concat.fasta
concats/4567/4567_concat.fasta
concats/9876/9876_concat.fasta

CodePudding user response：

A solution in plain bash would be:

cd /path/to/my_project || exit
for src in */*/*_contigs.fasta; do
    IFS=/ read -r _ dir file <<< "$src"
    mkdir -p "concats/$dir"
    cat "$src" >> "concats/$dir/${file/contigs/concat}"
done