Looking for an efficient way to get the top 3 (expandable) of multiple numeric fields from a large data set on a Linux server; this is follow on to https://stackoverflow.com/a/70117001/8823709 where the best suggestion to
"I have an awk array that aggregates bytes up and downloaded. I can sort the output by either bytes down or up and pipe that to head for the top talkers; is it possible to output two sorts using different keys?"
was :
zgrep '^1' 20211014T00*.gz|
awk '
NR > 1 {
key = $1 " " $2
bytesdown[key] = $3
bytesup[key] = $4
}
END {
cmd = "sort -rn | head -3"
for ( key in bytesDown ) {
print bytesDown[key], bytesUp[key], key | cmd
}
close(cmd)
cmd = "sort -rnk2 | head -3"
for ( key in bytesDown ) {
print bytesDown[key], bytesUp[key], key | cmd
}
close(cmd)
}
'
However, as the data set can range from 1000's to many millions of rows, rather than read the entire set into an array, sort and discard the vast majority, is it practical to maintain an array of the top 10s as the data is read in? Absolute speed is less of an issue than memory consumption, a relatively limited resource on the server.
For example, given the following sample input:
ip1 fqdn101 101 10
ip2 fqdn102 102 11
ip3 fqdn103 103 12
ip4 fqdn104 104 13
ip1 fqdn101 105 14
ip1 fqdn102 106 15
ip1 fqdn103 107 16
ip1 fqdn104 108 17
ip2 fqdn103 109 16
ip2 fqdn104 110 17
That should output
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16
and
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17
Open to options other than awk - though that would be my default starting point - as long as they are available on the corporate Linux server build I'm given...
CodePudding user response:
Since you said "Absolute speed is less of an issue than memory consumption", here's how to do what you want quickly and using minimal memory:
$ cat tst.sh
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
trap 'rm -f "$tmp"; exit' 0
sort -k1,1 -k2,2 "${@:--}" |
awk '
{ key = $1 " " $2 }
key != prev {
if ( NR>1 ) {
print prev, tot3, tot4
}
tot3 = tot4 = 0
prev = key
}
{
tot3 = $3
tot4 = $4
}
END {
print key, tot3, tot4
}
' > "$tmp"
sort -nrk3,3 "$tmp" | head -3
printf '\n'
sort -nrk4,4 "$tmp" | head -3
$ ./tst.sh file
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17
The only tool in the above that has to handle the whole of the input at once is sort
and it's designed to use demand paging, etc. to handle huge files so it doesn't need to be able to store the whole of the input in memory to be able to work. The temp file used above will be much smaller than your original input file as it only stores the totals for each of the key pairs.
If you don't want to use a temp file then you could do this which will take a bit longer (maybe 1.5 times?) to run:
$ cat tst.sh
#!/usr/bin/env bash
do1tot() {
local totPos="$1"
shift
sort -k1,1 -k2,2 "${@:--}" |
awk '
{ key = $1 " " $2 }
key != prev {
if ( NR>1 ) {
print prev, tot3, tot4
}
tot3 = tot4 = 0
prev = key
}
{
tot3 = $3
tot4 = $4
}
END {
print key, tot3, tot4
}
' |
sort -nrk"$totPos,$totPos" |
head -3
}
do1tot 3 "${@:--}"
printf "\n"
do1tot 4 "${@:--}"
$ ./tst.sh file
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17