Home > Enterprise >  Divide largest value by second largest value
Divide largest value by second largest value

Time:09-23

I am having a file in the following format. Column one has ~20,000 uniq entry and column 2 has ~120,000 different entry and column 3 has count associated with column 2. For a single entry in column 1 there can be multiple entry in column 2. For each unique entry in column 1, I am trying to get ratio of maximum value to second maximum value of column 3.

F1.txt

S1      S2      C1
A       A1      1
A       AA      10
A       A6      5
A       A0      4
B       BB      12
B       BC      11
B       B1      19
B       B9      4

Expected Output

S1  S2  C1  
B   B1  19  1.58333
A   AA  10  2

I can do in steps like bellow. But is there a smart way of doing in in one script?

awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r "}' F1.txt | awk '!seen[$1] ' >del1.txt

awk 'FNR==NR{a[$2]=1; next}FNR==1{print $0;}!a[$2]' del1.txt F1.txt | awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r"}' | awk '!seen[$1] ' >del2.txt

awk 'FNR==NR{a[$1]=$3; next}FNR==1{print $0"\t";"RT"}FNR>1 a[$1]{print $0"\t"$3/a[$1]}' del2.txt del1.txt

CodePudding user response:

#!/usr/bin/awk -f

NR == 1 { print $1, $2, $3; next }
{ data[$1][$3] = $2 }
END {
  for (key in data) {
    asorti(data[key], s, "@ind_num_desc")
    print key, data[key][s[1]], s[1], s[1] / s[2]
  }
}

This^^^ assumes an arbitrary permutation of the lines (and requires gawk (which is pretty common) or another implementation with native multi-dimensional “arrays”).

If you can make more assumptions about the input — e.g. that it is always grouped by the first column —, then you can make it more memory-efficient and get rid of multi-dimensional arrays (by not delaying the evaluation until END and instead calculating it in a per-line block each time the first column’s value changes (and then one last time in END).)

To get a different handling of equal numeric values (e.g. to report the “subkey” (column 2) of the first (instead of last) encountered occurrence of a value), you could add if (!($3 in data[$1])) ... or the like.

CodePudding user response:

Whenever you find yourself creating a pipeline containing awk, there is a very good chance that what you are trying to do can be done in a single call to awk much more efficiently.

A non-GNU awk approach that presumes all field1 'A' records are together and all 'B' records are together (as you show in your sample data) could be:

awk '
    FNR==1 { print; next }      ## 1st line, output heading
    $1 != n  {                  ## 1st field changed
        if (n) {                ## if n set, output result of last block
            printf "%s\t%s\n", rec, max / nextmax
        }
        rec = $0                ## initialize vars for next block
        n = $1
        max = $3
        nextmax = 1
        next                    ## skip to next record
    }
    {
        if ($3 > max) {         ## check if 3rd field > max
            rec = $0            ## save record
            nextmax = max       ## update nextmax
            max = $3            ## update max
        }
        else if ($3 > nextmax) {    ## if 3rd field > nextmax
            nextmax = $3            ## update nextmax
        }
    }   ## output final block results
    END { printf "%s\t%s\n", rec, max / nextmax }
' file

Example Use/Output

With your data in the file file, you would have:

$ awk '
>     FNR==1 { print; next }      ## 1st line, output heading
>     $1 != n  {                  ## 1st field changed
>         if (n) {                ## if n set, output result of last block
>             printf "%s\t%s\n", rec, max / nextmax
>         }
>         rec = $0                ## initialize vars for next block
>         n = $1
>         max = $3
>         nextmax = 1
>         next                    ## skip to next record
>     }
>     {
>         if ($3 > max) {         ## check if 3rd field > max
>             rec = $0            ## save record
>             nextmax = max       ## update nextmax
>             max = $3            ## update max
>         }
>         else if ($3 > nextmax) {    ## if 3rd field > nextmax
>             nextmax = $3            ## update nextmax
>         }
>     }   ## output final block results
>     END { printf "%s\t%s\n", rec, max / nextmax }
> ' file
S1      S2      C1
A       AA      10      2
B       B1      19      1.58333
  • Related