Home > Software engineering >  Awk in do-while loop, read files with multiple columns
Awk in do-while loop, read files with multiple columns

Time:01-01

I'm writing a script which reads variables from the columns of one file, pattern matches with a column of another file, and prints whole lines from this other file to an output file. I think I've come close to a solution, but only column names in the file outputs are produced (NR==1 is written, nothing else).

while read -r value1 value2
do
        echo "$value2"
        awk -F "\t" 'NR==1; $2 == "$value1" { print $0 }' data.txt > "${value2}".out
done < "var_table.txt" 

Here's an example head of my data:

data.txt

V1  V2  V3  V4
1   dat-131.2   2071.49 3.11E-302
2   dat-219.1   744.48  3.11E-302
3   dat-120.2   13.29   3.11E-302
4   dat-577.1   1.71    3.11E-302
5   dat-535.2   16.07   3.11E-302
6   dat-535.2   12.99   3.11E-302
7   dat-535.2   2147.31 3.11E-302
8   dat-535.2   61.21   3.11E-302
…
    

var_table.txt

dat-501.16  VarA
dat-151.5   VarB
dat-518.9   VarC
dat-535.2   VarD
dat-49.0    VarE

The expected output would be to create a tsv file called VarD.out ("${value2}.out), populated with a header (NR==1), plus whole rows from data.txt corresponding to the value "dat-535.2" in column 2 (print $0 if $2 == "$value1").

VarD.out _|>

V1  V2  V3  V4
5   dat-535.2   16.07   3.11E-302
6   dat-535.2   12.99   3.11E-302
7   dat-535.2   2147.31 3.11E-302
8   dat-535.2   61.21   3.11E-302

I hope this makes sense and is somewhat replicable. I've looked at other answers but am not able to understand how to incorporate multiple columns of data. My guess is the problem is variable assignment because of multiple files, as $2 and $0 correspond to data from data.txt, and value1 and value2 refer to columns of var_table.txt? Any help is welcome.

Thanks

CodePudding user response:

Fixing OP's code to properly pass a bash variable to awk:

while read -r value1 value2
do
        echo "$value2"
        awk -F "\t" -v value="$value1" 'NR==1 || $2==value' data.txt > "${value2}".out
done < "var_table.txt"

Where:

  • -v value="$value1" - assigns the value of the bash variable value1 to the awk varaible value
  • NR==1 || $2==value - if first record of input file, or 2nd field equals the awk variable value (aka bash variable value1) then write to output file

This generates:

$ head Var*out
==> VarA.out <==
V1      V2      V3      V4

==> VarB.out <==
V1      V2      V3      V4

==> VarC.out <==
V1      V2      V3      V4

==> VarD.out <==
V1      V2      V3      V4
5       dat-535.2       16.07   3.11E-302
6       dat-535.2       12.99   3.11E-302
7       dat-535.2       2147.31 3.11E-302
8       dat-535.2       61.21   3.11E-302

==> VarE.out <==
V1      V2      V3      V4

A more efficient method where we pull the loop into awk:

awk '
BEGIN      { FS=OFS="\t" }
FNR==NR    { vars[$1]=$2; next }                 # 1st file: save contents to array vars[] using 1st field as the index
FNR==1     { for (i in vars)                     # 2nd file: write header row to each of our output files (ie, for each index from the vars[] array)
                 print $0 > (vars[i] ".out")
             next
           }
$2 in vars { print $0 > (vars[$2] ".out") }      # 2nd file: if 2nd field is an index in array vars[] then print the current line to the associated output file
' var_table.txt data.txt

NOTES:

  • this approach will keep open a separate output file descriptor for each row in var_table.txt
  • for a sufficiently 'large' number of rows this could cause some versions of awk to abort due to running out of file descriptors
  • there are a few ways to address this issue (running out of file descriptors) but will require more coding and a better understanding of the input data, eg, will all rows in data.txt (with the same value in the 2nd field) always reside on consecutive lines in data.txt? or could they be spread throughout the file?

This also generates:

==> VarA.out <==
V1      V2      V3      V4

==> VarB.out <==
V1      V2      V3      V4

==> VarC.out <==
V1      V2      V3      V4

==> VarD.out <==
V1      V2      V3      V4
5       dat-535.2       16.07   3.11E-302
6       dat-535.2       12.99   3.11E-302
7       dat-535.2       2147.31 3.11E-302
8       dat-535.2       61.21   3.11E-302

==> VarE.out <==
V1      V2      V3      V4
  • Related