I have a 7GB csv file from which I need to find the first n values which occurred mostly in the 5th column. The format of my file is as:
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","389","65311"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","88","49194"
"instance id","src IP","destination ip","12489","49194"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","63812","389"
Using the following command
awk -F , ' !count[$5] {save[$5] = $0; next} count[$5] > 0 { print save[$5] " is " count[$5] " times"} ' example.csv
able to find the following output
"instance id","src IP","destination ip","63812","389" is 1 times
"instance id","src IP","destination ip","389","65311" is 1 times
"instance id","src IP","destination ip","63812","389" is 2 times
"instance id","src IP","destination ip","88","49194" is 1 times
"instance id","src IP","destination ip","12489","49194" is 2 times
"instance id","src IP","destination ip","63812","389" is 3 times
"instance id","src IP","destination ip","63812","389" is 4 times
but not able to understand how to get the top 50 rows in which the 5th column entry duplicated most of the time.
Let's suppose if n=2, then my output should be like following:
"instance id","src IP","destination ip","63812","389" is 4 times
"instance id","src IP","destination ip","12489","49194" is 2 times
CodePudding user response:
If I were you, I would alter the output into something like:
1 times : "instance id","src IP","destination ip","63812","389" is 1 times
1 times : "instance id","src IP","destination ip","389","65311" is 1 times
2 times : "instance id","src IP","destination ip","63812","389" is 2 times
1 times : "instance id","src IP","destination ip","88","49169" is 1 times
2 times : "instance id","src IP","destination ip","12489","49194" is 2 times
3 times : "instance id","src IP","destination ip","63812","389" is 3 times
4 times : "instance id","src IP","destination ip","63812","389" is 4 times
And then add | sort -n
after your command.
(I've been trying using sort -kx -n
but the combination of double quotes, commas and spaces messes up the -k
switch of the sort
command.)
CodePudding user response:
Using any sort awk head:
$ sort -t, -k5,5 file |
awk '
BEGIN { FS=OFS="," }
$5 != p5 { if (NR>1) print p0, cnt; p5=$5; p0=$0; cnt=0 }
{ cnt }
END { print p0, cnt }' |
sort -t, -k6,6rn |
head -2
"instance id","src IP","destination ip","63812","389",4
"instance id","src IP","destination ip","12489","49194",2
CodePudding user response:
This method is using GNU awk extensions:
Return n most frequent values in column p:
awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p] }
END { for(i in a) { if (!n--) { break }; print i } }' file
Return last record of n most frequent values in column p:
awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p] ;b[$p]=$0}
END { for(i in a) { if (!n--) { break }; print b[i] } }' file
Adopting the above to the expected output of the OP:
awk 'BEGIN{n=50; p=5; FS=","; PROCINFO["sorted_in"]="@val_num_desc"}
{a[$p] ;b[$p]=$0}
END { for(i in a) { if (!n--) { break }; print b[i],"is",a[i],"times" } }' file