Home > Enterprise >  Find the n largest occurrences of duplicate values in a column
Find the n largest occurrences of duplicate values in a column

Time:02-09

I have a 7GB csv file from which I need to find the first n values which occurred mostly in the 5th column. The format of my file is as:

"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","389","65311"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","88","49194"
"instance id","src IP","destination ip","12489","49194"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","63812","389"

Using the following command awk -F , ' !count[$5] {save[$5] = $0; next} count[$5] > 0 { print save[$5] " is " count[$5] " times"} ' example.csv able to find the following output

"instance id","src IP","destination ip","63812","389" is 1 times
"instance id","src IP","destination ip","389","65311" is 1 times
"instance id","src IP","destination ip","63812","389" is 2 times
"instance id","src IP","destination ip","88","49194" is 1 times
"instance id","src IP","destination ip","12489","49194" is 2 times
"instance id","src IP","destination ip","63812","389" is 3 times
"instance id","src IP","destination ip","63812","389" is 4 times

but not able to understand how to get the top 50 rows in which the 5th column entry duplicated most of the time.

Let's suppose if n=2, then my output should be like following:

"instance id","src IP","destination ip","63812","389" is 4 times
"instance id","src IP","destination ip","12489","49194" is 2 times

CodePudding user response:

If I were you, I would alter the output into something like:

1 times : "instance id","src IP","destination ip","63812","389" is 1 times
1 times : "instance id","src IP","destination ip","389","65311" is 1 times
2 times : "instance id","src IP","destination ip","63812","389" is 2 times
1 times : "instance id","src IP","destination ip","88","49169" is 1 times
2 times : "instance id","src IP","destination ip","12489","49194" is 2 times
3 times : "instance id","src IP","destination ip","63812","389" is 3 times
4 times : "instance id","src IP","destination ip","63812","389" is 4 times

And then add | sort -n after your command.

(I've been trying using sort -kx -n but the combination of double quotes, commas and spaces messes up the -k switch of the sort command.)

CodePudding user response:

Using any sort awk head:

$ sort -t, -k5,5 file |
    awk '
        BEGIN { FS=OFS="," }
        $5 != p5 { if (NR>1) print p0, cnt; p5=$5; p0=$0; cnt=0 }
        { cnt   }
        END { print p0, cnt }' |
    sort -t, -k6,6rn |
    head -2
"instance id","src IP","destination ip","63812","389",4
"instance id","src IP","destination ip","12489","49194",2

CodePudding user response:

This method is using GNU awk extensions:

Return n most frequent values in column p:

awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p]  }
     END { for(i in a) { if (!n--) { break }; print i } }' file

Return last record of n most frequent values in column p:

awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p]  ;b[$p]=$0}
     END { for(i in a) { if (!n--) { break }; print b[i] } }' file

Adopting the above to the expected output of the OP:

awk 'BEGIN{n=50; p=5; FS=","; PROCINFO["sorted_in"]="@val_num_desc"}
     {a[$p]  ;b[$p]=$0}
     END { for(i in a) { if (!n--) { break }; print b[i],"is",a[i],"times" } }' file
  •  Tags:  
  • Related