Home > Net >  Efficient top 10 sort of multiple columns
Efficient top 10 sort of multiple columns

Time:12-05

Looking for an efficient way to get the top 3 (expandable) of multiple numeric fields from a large data set on a Linux server; this is follow on to https://stackoverflow.com/a/70117001/8823709 where the best suggestion to

"I have an awk array that aggregates bytes up and downloaded. I can sort the output by either bytes down or up and pipe that to head for the top talkers; is it possible to output two sorts using different keys?"

was :

zgrep '^1' 20211014T00*.gz|
awk '
    NR > 1 {
         key = $1 " " $2
         bytesdown[key]  = $3
         bytesup[key]  = $4
     }
     END {
         cmd = "sort -rn | head -3"
         for ( key in bytesDown ) {
             print bytesDown[key], bytesUp[key], key | cmd
         }
         close(cmd)
 
         cmd = "sort -rnk2 | head -3"
         for ( key in bytesDown ) {
             print bytesDown[key], bytesUp[key], key | cmd
         }
         close(cmd)
     }
'

However, as the data set can range from 1000's to many millions of rows, rather than read the entire set into an array, sort and discard the vast majority, is it practical to maintain an array of the top 10s as the data is read in? Absolute speed is less of an issue than memory consumption, a relatively limited resource on the server.

For example, given the following sample input:

ip1     fqdn101 101     10
ip2     fqdn102 102     11
ip3     fqdn103 103     12
ip4     fqdn104 104     13
ip1     fqdn101 105     14
ip1     fqdn102 106     15
ip1     fqdn103 107     16
ip1     fqdn104 108     17
ip2     fqdn103 109     16
ip2     fqdn104 110     17

That should output

ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16

and

ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17

Open to options other than awk - though that would be my default starting point - as long as they are available on the corporate Linux server build I'm given...

CodePudding user response:

Since you said "Absolute speed is less of an issue than memory consumption", here's how to do what you want quickly and using minimal memory:

$ cat tst.sh
#!/usr/bin/env bash

tmp=$(mktemp) || exit 1
trap 'rm -f "$tmp"; exit' 0

sort -k1,1 -k2,2 "${@:--}" |
awk '
    { key = $1 " " $2 }
    key != prev {
        if ( NR>1 ) {
            print prev, tot3, tot4
        }
        tot3 = tot4 = 0
        prev = key
    }
    {
        tot3  = $3
        tot4  = $4
    }
    END {
        print key, tot3, tot4
    }
' > "$tmp"

sort -nrk3,3 "$tmp" | head -3
printf '\n'
sort -nrk4,4 "$tmp" | head -3

$ ./tst.sh file
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16

ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17

The only tool in the above that has to handle the whole of the input at once is sort and it's designed to use demand paging, etc. to handle huge files so it doesn't need to be able to store the whole of the input in memory to be able to work. The temp file used above will be much smaller than your original input file as it only stores the totals for each of the key pairs.

If you don't want to use a temp file then you could do this which will take a bit longer (maybe 1.5 times?) to run:

$ cat tst.sh
#!/usr/bin/env bash

do1tot() {
    local totPos="$1"
    shift

    sort -k1,1 -k2,2 "${@:--}" |
    awk '
        { key = $1 " " $2 }
        key != prev {
            if ( NR>1 ) {
                print prev, tot3, tot4
            }
            tot3 = tot4 = 0
            prev = key
        }
        {
            tot3  = $3
            tot4  = $4
        }
        END {
            print key, tot3, tot4
        }
    ' |
    sort -nrk"$totPos,$totPos" |
    head -3
}

do1tot 3 "${@:--}"
printf "\n"
do1tot 4 "${@:--}"

$ ./tst.sh file
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16

ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17
  • Related