Home > Blockchain >  Sort according to two columns and extract top two based on last column
Sort according to two columns and extract top two based on last column

Time:05-07

I have a file with three columns. I would like to extract rows with top two values in column 3 for each unique value in column 2.

cat file.list
run1/xx2/x2c1.txt 21 -190
run1/xx2/x2c2.txt 19 -180
run1/xx2/x2c3.txt 18 -179
run1/xx2/x2c4.txt 19 -162
run1/xx2/x2c5.txt 21 -172
run2/xx2/x2c1.txt 21 -162
run2/xx2/x2c2.txt 18 -192
run2/xx2/x2c3.txt 19 -191
run2/xx2/x2c4.txt 19 -184
run2/xx2/x2c5.txt 21 -179
run3/xx2/x2c1.txt 19 -162
run3/xx2/x2c2.txt 19 -192
run3/xx2/x2c3.txt 21 -191
run3/xx2/x2c4.txt 18 -184
run3/xx2/x2c5.txt 19 -179

expected output

run2/xx2/x2c2.txt 18 -192
run3/xx2/x2c4.txt 18 -184
run3/xx2/x2c2.txt 19 -192
run2/xx2/x2c3.txt 19 -191
run3/xx2/x2c3.txt 21 -191
run1/xx2/x2c1.txt 21 -190

I feel like some combination of sort, uniq and awk might accomplish but I can't properly execute it. I can sort by columns

sort -nk2 -nk3 file.list 

which gives me an output sorted by -k2 and -k3 as follows,

run2/xx2/x2c2.txt 18 -192
run3/xx2/x2c4.txt 18 -184
run1/xx2/x2c3.txt 18 -179
run3/xx2/x2c2.txt 19 -192
run2/xx2/x2c3.txt 19 -191
run2/xx2/x2c4.txt 19 -184
run1/xx2/x2c2.txt 19 -180
run3/xx2/x2c5.txt 19 -179
run1/xx2/x2c4.txt 19 -162
run3/xx2/x2c1.txt 19 -162
run3/xx2/x2c3.txt 21 -191
run1/xx2/x2c1.txt 21 -190
run2/xx2/x2c5.txt 21 -179
run1/xx2/x2c5.txt 21 -172
run2/xx2/x2c1.txt 21 -162

but then I get stuck on how to extract only the rows with best two scores in the last column for 18, 19 and 20.

I would really appreciate any bash solutions.

CodePudding user response:

Piping the current sort results to awk:

$ sort -nk2 -nk3 file.list | awk 'a[$2]   < 2'
run2/xx2/x2c2.txt 18 -192
run3/xx2/x2c4.txt 18 -184
run3/xx2/x2c2.txt 19 -192
run2/xx2/x2c3.txt 19 -191
run3/xx2/x2c3.txt 21 -191
run1/xx2/x2c1.txt 21 -190

Where:

  • field #2 ($2) is used as the index for array a[]
  • if the value stored in the array is less than 2 then print the current input line
  • then increment the counter ( )
  • 1st time we see a[18] the count is 0, we print the line, and increment the count by 1
  • 2nd time we see a[18] the count is 1, we print the line, and increment the count by 1
  • 3rd (to nth) time we see a[18] the count is greater than or equal to 2, we do not print the line, and increment the count

An alternative where we increment the count first:

$ sort -nk2 -nk3 file.list | awk '  a[$2] <= 2'
run2/xx2/x2c2.txt 18 -192
run3/xx2/x2c4.txt 18 -184
run3/xx2/x2c2.txt 19 -192
run2/xx2/x2c3.txt 19 -191
run3/xx2/x2c3.txt 21 -191
run1/xx2/x2c1.txt 21 -190
  • Related