I would like to simply concatenate all contents of a set of files into a single new file. Each file may be identified by either a shared file name, or folder name. I see many examples of performing this task while files are in the same directory, but not where they are spread across sub directories.
Input
my_project
-- housecat
| -- 1234
| | -- 1234_contigs.fasta
| -- 1290
| | -- 1290_contigs.fasta
-- jaguar
| -- 1234
| | -- 1234_contigs.fasta
| -- 4567
| | -- 4567_contigs.fasta
| -- 9876
| | -- 9876_contigs.fasta
-- puma
| -- 0987
| | -- 0987_contigs.fasta
| -- 1029
| | -- 1029_contigs.fasta
| -- 1234
| | -- 1234_contigs.fasta
| -- 4567
| | -- 4567_contigs.fasta
an 'example' of the output files.
mkdir -p concats/1234
cat puma/1234/1234_contigs.fasta jaguar/1234/1234_contigs.fasta housecat/1234/1234_contigs.fasta >> concats/1234/1234_concat.fasta
more concats/1234_concat.fasta
minimally reproducible contents of housecat 1234
minimally reproducible contents of jaguar 1234
minimally reproducible contents of puma 1234
I would like this action to be performed for each of these types of files, even if their is only one such file (e.g. 1029 & 1290.fasta). I see that I can copy each of the files into a directory, and concatenate from there - but I would like to avoid that. Is this possible or should I just continue along the path of renaming the files, placing into the same folder, and combining them there?
DESIRED OUTPUT (not showing contents of all sub directories)
my_project
-- concats
| -- 0987/0987_concat.fasta
| -- 1029/1029_concat.fasta
| -- 1234/1234_concat.fasta
| -- 4567/4567_concat.fasta
| -- 1290/1290_concat.fasta
-- jaguar
-- housecat
-- puma
what i have so far:
FILENAME=$(find . -print | grep -E '[0-9]{3,4}_contigs.fasta') # this due to many many non-target files being present. I can move it into the script later just do not want to much focus on this.
for i in $FILENAME; do
FILE=$(basename "$i" | sed 's/_contigs//g')
DIR=concats/${FILE%.*}
ORGANISM=$(echo $i | cut -d/ -f 2)
mkdir -p -- "$DIR"
cp $i "${DIR}/${ORGANISM}_${FILE}" # rename the files here
done
for d in concats/*/ ; do
LOCI=$(echo $d | cut -d/ -f 2)
echo $d* > ${d}${LOCI}_concat.fasta
done
I was wondering if before I run the second loop I could use a cat like command to combine these files? Or whether I need to move them to destination and them combine them? Mostly just curious if I can avoid copying the files.
CodePudding user response:
Unless I'm missing something in the requirement there should be no need to first copy/rename the files before concatenating them into their final destination (and then deleting the copied/renamed files).
Finding the list of *_contigs.fasta
files and sorting by the (numeric) subdirectory and then by the parent directory:
$ cd my_project
$ find . -type f -name "*_contigs.fasta" | sort -t'/' -k4,4n -k1,1
./puma/0987/0987_contigs.fasta
./puma/1029/1029_contigs.fasta
./housecat/1234/1234_contigs.fasta
./jaguar/1234/1234_contigs.fasta
./puma/1234/1234_contigs.fasta
./housecat/1290/1290_contigs.fasta
./jaguar/4567/4567_contigs.fasta
./puma/4567/4567_contigs.fasta
./jaguar/9876/9876_contigs.fasta
One bash
idea:
cd my_project
'rm' -rf concats
mkdir concats
while read -r fname
do
IFS='/' read -r _ _ num _ <<< "${fname}"
newdir="concats/${num}"
newfile="${newdir}/${num}_concat.fasta"
[[ ! -d "${newdir}" ]] && mkdir -p "${newdir}"
cat "${fname}" >> "${newfile}"
done < <(find . -type f -name "*_contigs.fasta" | sort -t'/' -k4,4n -k1,1)
The result:
$ find concats -type f -name "*.fasta"
concats/0987/0987_concat.fasta
concats/1029/1029_concat.fasta
concats/1234/1234_concat.fasta
concats/1290/1290_concat.fasta
concats/4567/4567_concat.fasta
concats/9876/9876_concat.fasta
For a large number of files this bash
solution (above) may be slow if simply due to a) the number of iterations through the loop and b) the number of times each *_concat.fasta
file has to be opened/closed (one such file has to be opened/closed for each pass through the loop).
One awk
idea that should be faster:
cd my_project
'rm' -rf concats
mkdir concats
mapfile -t flist < <(find . -type f -name "*_contigs.fasta" | sort -t'/' -k4,4n -k1,1)
awk '
FNR==1 { split(FILENAME,a,"/")
newnum=a[3]
if (newnum != oldnum) {
close(outfile)
system("mkdir -p concats/" newnum )
}
oldnum=newnum
outfile="concats/" newnum "/" newnum "_concat.fasta"
}
{ print > outfile }
' "${flist[@]}"
The result:
$ find concats -type f -name "*.fasta"
concats/0987/0987_concat.fasta
concats/1029/1029_concat.fasta
concats/1234/1234_concat.fasta
concats/1290/1290_concat.fasta
concats/4567/4567_concat.fasta
concats/9876/9876_concat.fasta
CodePudding user response:
A solution in plain bash
would be:
cd /path/to/my_project || exit
for src in */*/*_contigs.fasta; do
IFS=/ read -r _ dir file <<< "$src"
mkdir -p "concats/$dir"
cat "$src" >> "concats/$dir/${file/contigs/concat}"
done