Home > Software engineering >  Sort 3rd column when others are already sorted
Sort 3rd column when others are already sorted

Time:02-23

I have that kind of big file :

PB.1060.1_1_1000        Chr1      484     817     20733209        
PB.1060.1_1_1000        Chr1      1       293     20733996       
PB.1060.1_1_1000        Chr1      287     485     20733577      
PB.1060.1_2_1001        Chr1      483     816     20733209
PB.1060.1_2_1001        Chr1      286     484     20733577

I need to order the 3rd column but keep the order of the first column. I should get :

PB.1060.1_1_1000        Chr1      1       293     20733996
PB.1060.1_1_1000        Chr1      287     485     20733577
PB.1060.1_1_1000        Chr1      484     817     20733209
PB.1060.1_2_1001        Chr1      286     484     20733577
PB.1060.1_2_1001        Chr1      483     816     20733209

I did -k1,1 -k3,3n but like the file is big, the first column looks like :

PB.1060.1_1000_1999
PB.1060.1_1000_1999
PB.1060.1_100_1099
PB.1060.1_100_1099
PB.1060.1_100_1099
PB.1060.1_100_1099
PB.1060.1_100_1099
PB.1060.1_1001_2000
PB.1060.1_1001_2000
PB.1060.1_1002_2001 ...

It should keep the order of the original file :

PB.1060.1_1_1000
PB.1060.1_1_1000
PB.1060.1_1_1000
PB.1060.1_1_1000
PB.1060.1_2_1001
PB.1060.1_2_1001
PB.1060.1_3_1002
PB.1060.1_4_1003
PB.1060.1_4_1003 ...

But no way .. Any help ?

CodePudding user response:

Try sort -k1,1 -k3,3n input:

$ cat input
PB.1060.1_1_1000        Chr1      484     817     20733209
PB.1060.1_1_1000        Chr1      1       293     20733996
PB.1060.1_1_1000        Chr1      287     485     20733577
PB.1060.1_2_1001        Chr1      483     816     20733209
PB.1060.1_2_1001        Chr1      286     484     20733577
$ sort -k1,1 -k3,3n input
PB.1060.1_1_1000        Chr1      1       293     20733996
PB.1060.1_1_1000        Chr1      287     485     20733577
PB.1060.1_1_1000        Chr1      484     817     20733209
PB.1060.1_2_1001        Chr1      286     484     20733577
PB.1060.1_2_1001        Chr1      483     816     20733209

CodePudding user response:

Assumptions/Understandings:

  • 1st column has already been sorted based on a 'V'ersion sort
  • we need to maintain the ordering of the 1st column, then sort duplicates by the 3rd column

Adding a few rows to our sample data:

$ cat input.dat
PB.1060.1_1_1000        Chr1      484     817     20733209
PB.1060.1_1_1000        Chr1      1       293     20733996
PB.1060.1_1_1000        Chr1      287     485     20733577
PB.1060.1_2_1001        Chr1      483     816     20733209
PB.1060.1_2_1001        Chr1      286     484     20733577
PB.1060.1_100_1099      Chr1      905     423     20733234
PB.1060.1_100_1099      Chr1      1020    523     20734234
PB.1060.1_1000_1999     Chr1      3422    223     20731234
PB.1060.1_1000_1999     Chr1      200     323     20732234
PB.1060.1_1001_2000     Chr1      900     623     20735234

One sort idea:

sort -k1,1V -k3,3n input.dat

Where:

  • apply a 'V'ersion sort to the 1st column
  • sort the 3rd column as a 'n'umber

This generates:

PB.1060.1_1_1000        Chr1      1       293     20733996
PB.1060.1_1_1000        Chr1      287     485     20733577
PB.1060.1_1_1000        Chr1      484     817     20733209
PB.1060.1_2_1001        Chr1      286     484     20733577
PB.1060.1_2_1001        Chr1      483     816     20733209
PB.1060.1_100_1099      Chr1      905     423     20733234
PB.1060.1_100_1099      Chr1      1020    523     20734234
PB.1060.1_1000_1999     Chr1      200     323     20732234
PB.1060.1_1000_1999     Chr1      3422    223     20731234
PB.1060.1_1001_2000     Chr1      900     623     20735234
  • Related