Hello I have 50 tables in tsv format all with the same column names in the same order:
e.g.
cat sample1.tsv | head -4
name | coverage | ID | bases | reads | length |
---|---|---|---|---|---|
vir1 | 0.535 | 3rf | 1252 | 53 | 11424 |
vir2 | 0.124 | 2ds | 7534 | 152 | 63221 |
vir3 | 0.643 | 6tf | 3341 | 73 | 21142 |
I want to elaborate a table from the "reads" column (5th column) from the 50 tables. The name column have the same values and same order along the 50 tables
Desired output:
cat reads_table.tsv | head -4
names | sample1 | sample2 | sample3 | sample4 | sample5 | sample50 |
---|---|---|---|---|---|---|
vir1 | 53 | 742 | 42 | 242 | 42 | 342 |
vir2 | 152 | 212 | 512 | 21 | 74 | 41 |
vir3 | 73 | 13 | 172 | 42 | 142 | 123 |
I was thinking on doing this by saving the reads column (the 5th column in all tables) to an array and using paste
bash function to paste the columns and save them to a new empty file called "reads_table.tsv" but I don't know how to do this on bash.
This is what I tried in a first instance:
for i in *.tsv
do
reads=$(awk '{print $5}' $i)
sed -i 's/$/\t$reads/' $i >> reads_table.tsv
done
CodePudding user response:
Created some input files to match OP's expected output:
$ head sample*.tsv
==> sample1.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 53 11424
vir2 0.124 2ds 7534 152 63221
vir3 0.643 6tf 3341 73 21142
==> sample2.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 742 11424
vir2 0.124 2ds 7534 212 63221
vir3 0.643 6tf 3341 13 21142
==> sample3.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 42 11424
vir2 0.124 2ds 7534 512 63221
vir3 0.643 6tf 3341 172 21142
==> sample4.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 242 11424
vir2 0.124 2ds 7534 21 63221
vir3 0.643 6tf 3341 42 21142
==> sample5.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 42 11424
vir2 0.124 2ds 7534 74 63221
vir3 0.643 6tf 3341 142 21142
==> sample50.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 342 11424
vir2 0.124 2ds 7534 41 63221
vir3 0.643 6tf 3341 123 21142
One awk
idea:
awk '
BEGIN { FS=OFS="\t" }
FNR==NR { lines[FNR]=$1 } # save 1st column ("name") of 1st file
FNR==1 { split(FILENAME,a,".") # 1st row of each file split FILENAME
lines[FNR]=lines[FNR] OFS a[1] # save FILENAME (sans ".tsv")
next
}
{ lines[FNR]=lines[FNR] OFS $5 } # rest of rows in file append te 5th column to our output lines
END { for (i=1;i<=FNR;i ) # loop through rows and ...
print lines[i] # print the associated line to stdout
}
' $(find . -name "sample*.tsv" -printf "%f\n" | sort -V ) > reads_table.tsv
NOTES:
- the
find/sort
is required to insure the files are fed toawk
inV
ersion sort order (eg,sample3.tsv
comes beforesample21.tsv
) - the
printf %f\n
removes the leading.\
from the filename (otherwise we could remove in theawk
script) - the
-V
option tellssort
to run aV
ersion sort
This generates:
name sample1 sample2 sample3 sample4 sample5 sample50
vir1 53 742 42 242 42 342
vir2 152 212 512 21 74 41
vir3 73 13 172 42 142 123
CodePudding user response:
This is not the most efficient method but it should get the job done
#!/bin/bash
reads_tab=reads_table.tsv
header="names"
for file in *.tsv
do
# Do not read from target file (if it exists)
[[ $file = "$reads_tab" ]] && continue
# Construct the "names" column at once, on the first iteration
[[ -f names.temp ]] || tail -n 2 "$file" | cut -f1 > names.temp
# Build header row from filenames
header =$'\t'"${file%.tsv}"
# Cut the fifth field and write to temporary file for each tsv file
tail -n 2 "$file" | cut -f5 > "$file".temp
done
{ printf '%s\n' "$header"; paste names.temp ./*.tsv.temp; } > "$reads_tab"
rm names.temp ./*.tsv.temp
An awk
solution may be more efficient, but consume more memory:
awk '
BEGIN { FS="\t"; ncol=1; cell["1,1"]="names" }
FNR==1 { cell["1," ncol]=FILENAME; next }
FNR==NR { cell[NR",1"]=$1 }
{ cell[FNR","ncol]=$5 }
END { for (row=1; row<=FNR; row)
for (col=1; col<=ncol; col)
printf "%s%s", cell[row","col], (col==ncol ? "\n" : "\t")
}
' *.tsv