how to make a table from the columns of other tables in bash?-CodePudding

Hello I have 50 tables in tsv format all with the same column names in the same order:

e.g.

cat sample1.tsv | head -4

name	coverage	ID	bases	reads	length
vir1	0.535	3rf	1252	53	11424
vir2	0.124	2ds	7534	152	63221
vir3	0.643	6tf	3341	73	21142

I want to elaborate a table from the "reads" column (5th column) from the 50 tables. The name column have the same values and same order along the 50 tables

Desired output:

cat reads_table.tsv | head -4

names	sample1	sample2	sample3	sample4	sample5	sample50
vir1	53	742	42	242	42	342
vir2	152	212	512	21	74	41
vir3	73	13	172	42	142	123

I was thinking on doing this by saving the reads column (the 5th column in all tables) to an array and using paste bash function to paste the columns and save them to a new empty file called "reads_table.tsv" but I don't know how to do this on bash.

This is what I tried in a first instance:

for i in *.tsv 
do 
reads=$(awk '{print $5}' $i)
sed -i 's/$/\t$reads/' $i >> reads_table.tsv
done

CodePudding user response：

Created some input files to match OP's expected output:

$ head sample*.tsv
==> sample1.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    53      11424
vir2    0.124   2ds     7534    152     63221
vir3    0.643   6tf     3341    73      21142

==> sample2.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    742     11424
vir2    0.124   2ds     7534    212     63221
vir3    0.643   6tf     3341    13      21142

==> sample3.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    42      11424
vir2    0.124   2ds     7534    512     63221
vir3    0.643   6tf     3341    172     21142

==> sample4.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    242     11424
vir2    0.124   2ds     7534    21      63221
vir3    0.643   6tf     3341    42      21142

==> sample5.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    42      11424
vir2    0.124   2ds     7534    74      63221
vir3    0.643   6tf     3341    142     21142

==> sample50.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    342     11424
vir2    0.124   2ds     7534    41      63221
vir3    0.643   6tf     3341    123     21142

One awk idea:

awk '
BEGIN   { FS=OFS="\t" }
FNR==NR { lines[FNR]=$1 }                        # save 1st column ("name") of 1st file
FNR==1  { split(FILENAME,a,".")                  # 1st row of each file split FILENAME
          lines[FNR]=lines[FNR] OFS a[1]         # save FILENAME (sans ".tsv")
          next
        }
        { lines[FNR]=lines[FNR] OFS $5 }         # rest of rows in file append te 5th column to our output lines
END     { for (i=1;i<=FNR;i  )                   # loop through rows and ...
              print lines[i]                     # print the associated line to stdout
        }
' $(find . -name "sample*.tsv" -printf "%f\n" | sort -V ) > reads_table.tsv

NOTES:

the find/sort is required to insure the files are fed to awk in Version sort order (eg, sample3.tsv comes before sample21.tsv)
the printf %f\n removes the leading .\ from the filename (otherwise we could remove in the awk script)
the -V option tells sort to run a Version sort

This generates:

name    sample1 sample2 sample3 sample4 sample5 sample50
vir1    53      742     42      242     42      342
vir2    152     212     512     21      74      41
vir3    73      13      172     42      142     123

CodePudding user response：

This is not the most efficient method but it should get the job done

#!/bin/bash

reads_tab=reads_table.tsv

header="names"
for file in *.tsv
do
    # Do not read from target file (if it exists)
    [[ $file = "$reads_tab" ]] && continue
    # Construct the "names" column at once, on the first iteration
    [[ -f names.temp ]] || tail -n 2 "$file" | cut -f1 > names.temp
    # Build header row from filenames
    header =$'\t'"${file%.tsv}"
    # Cut the fifth field and write to temporary file for each tsv file
    tail -n 2 "$file" | cut -f5 > "$file".temp
done
{ printf '%s\n' "$header"; paste names.temp ./*.tsv.temp; } > "$reads_tab"
rm names.temp ./*.tsv.temp

An awk solution may be more efficient, but consume more memory:

awk '
    BEGIN   { FS="\t"; ncol=1; cell["1,1"]="names" }
    FNR==1  { cell["1,"  ncol]=FILENAME; next }
    FNR==NR { cell[NR",1"]=$1 }
            { cell[FNR","ncol]=$5 }
    END { for (row=1; row<=FNR;   row)
              for (col=1; col<=ncol;   col)
                  printf "%s%s", cell[row","col], (col==ncol ? "\n" : "\t")
    }
' *.tsv