I have that kind of big file :
PB.1060.1_1_1000 Chr1 484 817 20733209
PB.1060.1_1_1000 Chr1 1 293 20733996
PB.1060.1_1_1000 Chr1 287 485 20733577
PB.1060.1_2_1001 Chr1 483 816 20733209
PB.1060.1_2_1001 Chr1 286 484 20733577
I need to order the 3rd column but keep the order of the first column. I should get :
PB.1060.1_1_1000 Chr1 1 293 20733996
PB.1060.1_1_1000 Chr1 287 485 20733577
PB.1060.1_1_1000 Chr1 484 817 20733209
PB.1060.1_2_1001 Chr1 286 484 20733577
PB.1060.1_2_1001 Chr1 483 816 20733209
I did -k1,1 -k3,3n
but like the file is big, the first column looks like :
PB.1060.1_1000_1999
PB.1060.1_1000_1999
PB.1060.1_100_1099
PB.1060.1_100_1099
PB.1060.1_100_1099
PB.1060.1_100_1099
PB.1060.1_100_1099
PB.1060.1_1001_2000
PB.1060.1_1001_2000
PB.1060.1_1002_2001 ...
It should keep the order of the original file :
PB.1060.1_1_1000
PB.1060.1_1_1000
PB.1060.1_1_1000
PB.1060.1_1_1000
PB.1060.1_2_1001
PB.1060.1_2_1001
PB.1060.1_3_1002
PB.1060.1_4_1003
PB.1060.1_4_1003 ...
But no way .. Any help ?
CodePudding user response:
Try sort -k1,1 -k3,3n input
:
$ cat input
PB.1060.1_1_1000 Chr1 484 817 20733209
PB.1060.1_1_1000 Chr1 1 293 20733996
PB.1060.1_1_1000 Chr1 287 485 20733577
PB.1060.1_2_1001 Chr1 483 816 20733209
PB.1060.1_2_1001 Chr1 286 484 20733577
$ sort -k1,1 -k3,3n input
PB.1060.1_1_1000 Chr1 1 293 20733996
PB.1060.1_1_1000 Chr1 287 485 20733577
PB.1060.1_1_1000 Chr1 484 817 20733209
PB.1060.1_2_1001 Chr1 286 484 20733577
PB.1060.1_2_1001 Chr1 483 816 20733209
CodePudding user response:
Assumptions/Understandings:
- 1st column has already been sorted based on a 'V'ersion sort
- we need to maintain the ordering of the 1st column, then sort duplicates by the 3rd column
Adding a few rows to our sample data:
$ cat input.dat
PB.1060.1_1_1000 Chr1 484 817 20733209
PB.1060.1_1_1000 Chr1 1 293 20733996
PB.1060.1_1_1000 Chr1 287 485 20733577
PB.1060.1_2_1001 Chr1 483 816 20733209
PB.1060.1_2_1001 Chr1 286 484 20733577
PB.1060.1_100_1099 Chr1 905 423 20733234
PB.1060.1_100_1099 Chr1 1020 523 20734234
PB.1060.1_1000_1999 Chr1 3422 223 20731234
PB.1060.1_1000_1999 Chr1 200 323 20732234
PB.1060.1_1001_2000 Chr1 900 623 20735234
One sort
idea:
sort -k1,1V -k3,3n input.dat
Where:
- apply a 'V'ersion sort to the 1st column
- sort the 3rd column as a 'n'umber
This generates:
PB.1060.1_1_1000 Chr1 1 293 20733996
PB.1060.1_1_1000 Chr1 287 485 20733577
PB.1060.1_1_1000 Chr1 484 817 20733209
PB.1060.1_2_1001 Chr1 286 484 20733577
PB.1060.1_2_1001 Chr1 483 816 20733209
PB.1060.1_100_1099 Chr1 905 423 20733234
PB.1060.1_100_1099 Chr1 1020 523 20734234
PB.1060.1_1000_1999 Chr1 200 323 20732234
PB.1060.1_1000_1999 Chr1 3422 223 20731234
PB.1060.1_1001_2000 Chr1 900 623 20735234