Home > Software engineering >  how to make a table from the columns of other tables in bash?
how to make a table from the columns of other tables in bash?

Time:11-01

Hello I have 50 tables in tsv format all with the same column names in the same order:

e.g.

cat sample1.tsv | head -4 
name coverage ID bases reads length
vir1 0.535 3rf 1252 53 11424
vir2 0.124 2ds 7534 152 63221
vir3 0.643 6tf 3341 73 21142

I want to elaborate a table from the "reads" column (5th column) from the 50 tables. The name column have the same values and same order along the 50 tables

Desired output:

cat reads_table.tsv | head -4 
names sample1 sample2 sample3 sample4 sample5 sample50
vir1 53 742 42 242 42 342
vir2 152 212 512 21 74 41
vir3 73 13 172 42 142 123

I was thinking on doing this by saving the reads column (the 5th column in all tables) to an array and using paste bash function to paste the columns and save them to a new empty file called "reads_table.tsv" but I don't know how to do this on bash.

This is what I tried in a first instance:

for i in *.tsv 
do 
reads=$(awk '{print $5}' $i)
sed -i 's/$/\t$reads/' $i >> reads_table.tsv
done 

CodePudding user response:

Created some input files to match OP's expected output:

$ head sample*.tsv
==> sample1.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    53      11424
vir2    0.124   2ds     7534    152     63221
vir3    0.643   6tf     3341    73      21142

==> sample2.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    742     11424
vir2    0.124   2ds     7534    212     63221
vir3    0.643   6tf     3341    13      21142

==> sample3.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    42      11424
vir2    0.124   2ds     7534    512     63221
vir3    0.643   6tf     3341    172     21142

==> sample4.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    242     11424
vir2    0.124   2ds     7534    21      63221
vir3    0.643   6tf     3341    42      21142

==> sample5.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    42      11424
vir2    0.124   2ds     7534    74      63221
vir3    0.643   6tf     3341    142     21142

==> sample50.tsv <==
name    coverage        ID      bases   reads   length
vir1    0.535   3rf     1252    342     11424
vir2    0.124   2ds     7534    41      63221
vir3    0.643   6tf     3341    123     21142

One awk idea:

awk '
BEGIN   { FS=OFS="\t" }
FNR==NR { lines[FNR]=$1 }                        # save 1st column ("name") of 1st file
FNR==1  { split(FILENAME,a,".")                  # 1st row of each file split FILENAME
          lines[FNR]=lines[FNR] OFS a[1]         # save FILENAME (sans ".tsv")
          next
        }
        { lines[FNR]=lines[FNR] OFS $5 }         # rest of rows in file append te 5th column to our output lines
END     { for (i=1;i<=FNR;i  )                   # loop through rows and ...
              print lines[i]                     # print the associated line to stdout
        }
' $(find . -name "sample*.tsv" -printf "%f\n" | sort -V ) > reads_table.tsv

NOTES:

  • the find/sort is required to insure the files are fed to awk in Version sort order (eg, sample3.tsv comes before sample21.tsv)
  • the printf %f\n removes the leading .\ from the filename (otherwise we could remove in the awk script)
  • the -V option tells sort to run a Version sort

This generates:

name    sample1 sample2 sample3 sample4 sample5 sample50
vir1    53      742     42      242     42      342
vir2    152     212     512     21      74      41
vir3    73      13      172     42      142     123

CodePudding user response:

This is not the most efficient method but it should get the job done

#!/bin/bash

reads_tab=reads_table.tsv

header="names"
for file in *.tsv
do
    # Do not read from target file (if it exists)
    [[ $file = "$reads_tab" ]] && continue
    # Construct the "names" column at once, on the first iteration
    [[ -f names.temp ]] || tail -n 2 "$file" | cut -f1 > names.temp
    # Build header row from filenames
    header =$'\t'"${file%.tsv}"
    # Cut the fifth field and write to temporary file for each tsv file
    tail -n 2 "$file" | cut -f5 > "$file".temp
done
{ printf '%s\n' "$header"; paste names.temp ./*.tsv.temp; } > "$reads_tab"
rm names.temp ./*.tsv.temp

An awk solution may be more efficient, but consume more memory:

awk '
    BEGIN   { FS="\t"; ncol=1; cell["1,1"]="names" }
    FNR==1  { cell["1,"  ncol]=FILENAME; next }
    FNR==NR { cell[NR",1"]=$1 }
            { cell[FNR","ncol]=$5 }
    END { for (row=1; row<=FNR;   row)
              for (col=1; col<=ncol;   col)
                  printf "%s%s", cell[row","col], (col==ncol ? "\n" : "\t")
    }
' *.tsv
  • Related