Home > Enterprise >  alphanumeric or "Version sort" in awk array traversal
alphanumeric or "Version sort" in awk array traversal

Time:09-22

I work with bioinformatic files that have a combination of characters and numbers in a row like below

EDIT updated Example

chr1    1    100
chr10   1    500
chr2    33   52
chr5    11   66
chr22   99   1052
chr11   444  2141
chr2    555  1200
chr7    300   444
chr7    44   222
chr21   24   6023
chr16   224  5521
chr3    3    200
chrX    6234    79593
chr1    5291    5500
chrY    204  310

which I want to sort like this

chr1    1    100
chr1    5291    5500
chr2    33   52
chr2    555  1200
chr3    3    200
chr5    11   66
chr7    44   222
chr7    300   444
chr10   1    500
chr11   444  2141
chr16   224  5521
chr21   24   6023
chr22   99   1052
chrX    6234    79593
chrY    204  310

I am using controlled array transversal to sort these as I scan them with for loops, but the established sorting methods are either numerical, string, or type sorting. I would like something exactly like the GNU sort -V sorting. In the link below I see that I can build a custom function for sorting, but I am not sure how to do this for alphanumeric values.

I had the idea to do something like this and sort the letter values and numeric parts separately and the figure out how to combine them, but I'm not sure exactly how to do that.

echo "d23FgE55" | awk '{alph=$1;num=$1;gsub("[0-9] ","-",alph);gsub("[A-z] ","-",num);print alph,num}'
d-FgE- -23-55

EDIT

I apologize, one of the answers reminded me of some of the more complicated examples I run into sometimes that made me ask about something closer to sort -V behavior.

Unsorted example

chr1_KI270708v1_random  9240458 9393655
chrUn_KI270589v1        5789405 6182867
chr19_GL383576v1_alt    4363702 4753141
chr8_KI270820v1_alt     1008865 1426444
chrUn_GL000220v1        7612088 7825236
chrUn_KI270591v1        9457975 9812609
chr19_GL949747v2_alt    1578276 2033118
chr13_KI270841v1_alt    8680841 9033557
chr17_KI270859v1_alt    3996864 4344945
chr20_GL383577v2_alt    3002112 3480396
chrUn_KI270322v1        1563181 1629375
chrUn_KI270419v1        5482364 5893900
chrUn_KI270310v1        4229845 4626802
chr17_KI270907v1_alt    1735306 2201566
chr9    2052238 2476827
chr1_KI270713v1_random  7260088 7358876
chr19_KI270891v1_alt    7485890 7719006
chr19_KI270917v1_alt    7816269 7864474
chrUn_KI270378v1        8848225 9158581
chr19_KI270933v1_alt    1600444 2095219
chr3_KI270937v1_alt     9902343 10113942
chr12_KI270904v1_alt    6924313 7067502
chrUn_GL000214v1        2468418 2728693
chr19_KI270884v1_alt    2760167 3027068
chrUn_KI270582v1        6390491 6398266
chr5_KI270897v1_alt     324409  548890
chr1_GL383520v2_alt     3718510 3948906
chr19_KI270888v1_alt    3809944 3824324
chr11_JH159137v1_alt    6090599 6480896
chr1_KI270762v1_alt     4196600 4684821
chr11_KI270831v1_alt    2112401 2490081
chr5_GL339449v2_alt     7220557 7718111
chr19_KI270889v1_alt    8709969 8823256
chr19_KI270931v1_alt    6356216 6811953
chr5_GL383530v1_alt     109389  128439
chr11_KI270927v1_alt    2119470 2348459
chr17_KI270730v1_random 404268  854145
chrUn_KI270387v1        7430161 7648856
chr10   2656835 2873499
chr19_GL383573v1_alt    7863497 7896279
chrUn_KI270741v1        1292450 1371680
chrUn_KI270335v1        7360266 7748998
chr9_GL383539v1_alt     3394859 3499461
chrX    4490524 4892623
chrUn_KI270311v1        963681  1069745
chr11_JH159136v1_alt    8171978 8319851
chr17_JH159148v1_alt    3708868 3984631
chrUn_KI270544v1        7025954 7392905
chr19_KI270888v1_alt    9683785 10166473
chrUn_KI270521v1        6924036 7001552
chr1_KI270710v1_random  8843336 9304602
chr19_GL949746v1_alt    8572018 8832793
chrUn_KI270322v1        9392841 9512920
chrUn_KI270366v1        3332191 3576201
chr13_KI270841v1_alt    9033828 9276044
chr19_GL949748v2_alt    3321575 3743545
chr16   3704062 4122526
chr17_GL383563v3_alt    7476487 7845527
chr8_KI270810v1_alt     8499569 8873953
chr22_KI270732v1_random 9187626 9418154
chr20_GL383577v2_alt    3031797 3421627
chr9_GL383541v1_alt     3771485 3927905
chr9_KI270720v1_random  8948742 9157627
chr16_KI270854v1_alt    7278604 7297845
chr1_KI270763v1_alt     1518275 1527443
chrUn_KI270517v1        3377374 3454859
chr15_KI270850v1_alt    6358541 6822565
chr22_KB663609v1_alt    4944645 4971886
chr7_KI270806v1_alt     9032201 9471652
chrUn_KI270438v1        8523562 8944980
chr17_KI270730v1_random 3544067 3796807
chr18   1856815 2144546
chr20_KI270869v1_alt    2269342 2353172
chr5_GL383530v1_alt     2139701 2285854
chr8_KI270819v1_alt     7048265 7503415
chr17_JH159148v1_alt    7040113 7042904
chr5_KI270793v1_alt     9363008 9816819
chr19_KI270931v1_alt    1794178 2143519
chr3_KI270778v1_alt     2228100 2549359
chr19_KI270932v1_alt    8320855 8486835
chr12_GL877876v1_alt    3736839 3820171
chr12_GL877876v1_alt    2805577 2974710
chr4_KI270789v1_alt     3309756 3669565
chr19_KI270917v1_alt    9443280 9678387
chr11_KI270831v1_alt    8603751 9028904
chrUn_KI270387v1        5189812 5439563
chrUn_KI270507v1        599827  666674
chr1_KI270711v1_random  6111532 6446935
chr2_KI270773v1_alt     8604675 8922311
chr16_KI270856v1_alt    5855373 6089898
chr3    3097856 3436204
chr13_KI270840v1_alt    3127654 3295739
chr15_KI270849v1_alt    8948327 9336376
chr18_KI270863v1_alt    3984764 4166006
chr19_KI270920v1_alt    5554949 5919294
chr12   6624798 7106478
chr4_GL383528v1_alt     2099770 2280753
chr2_KI270769v1_alt     8706329 9147304
chr8_KI270812v1_alt     6146462 6388606

Sorted Output

chr1_GL383520v2_alt     3718510 3948906
chr1_KI270708v1_random  9240458 9393655
chr1_KI270710v1_random  8843336 9304602
chr1_KI270711v1_random  6111532 6446935
chr1_KI270713v1_random  7260088 7358876
chr1_KI270762v1_alt     4196600 4684821
chr1_KI270763v1_alt     1518275 1527443
chr2_KI270769v1_alt     8706329 9147304
chr2_KI270773v1_alt     8604675 8922311
chr3    3097856 3436204
chr3_KI270778v1_alt     2228100 2549359
chr3_KI270937v1_alt     9902343 10113942
chr4_GL383528v1_alt     2099770 2280753
chr4_KI270789v1_alt     3309756 3669565
chr5_GL339449v2_alt     7220557 7718111
chr5_GL383530v1_alt     109389  128439
chr5_GL383530v1_alt     2139701 2285854
chr5_KI270793v1_alt     9363008 9816819
chr5_KI270897v1_alt     324409  548890
chr7_KI270806v1_alt     9032201 9471652
chr8_KI270810v1_alt     8499569 8873953
chr8_KI270812v1_alt     6146462 6388606
chr8_KI270819v1_alt     7048265 7503415
chr8_KI270820v1_alt     1008865 1426444
chr9    2052238 2476827
chr9_GL383539v1_alt     3394859 3499461
chr9_GL383541v1_alt     3771485 3927905
chr9_KI270720v1_random  8948742 9157627
chr10   2656835 2873499
chr11_JH159136v1_alt    8171978 8319851
chr11_JH159137v1_alt    6090599 6480896
chr11_KI270831v1_alt    2112401 2490081
chr11_KI270831v1_alt    8603751 9028904
chr11_KI270927v1_alt    2119470 2348459
chr12   6624798 7106478
chr12_GL877876v1_alt    2805577 2974710
chr12_GL877876v1_alt    3736839 3820171
chr12_KI270904v1_alt    6924313 7067502
chr13_KI270840v1_alt    3127654 3295739
chr13_KI270841v1_alt    8680841 9033557
chr13_KI270841v1_alt    9033828 9276044
chr15_KI270849v1_alt    8948327 9336376
chr15_KI270850v1_alt    6358541 6822565
chr16   3704062 4122526
chr16_KI270854v1_alt    7278604 7297845
chr16_KI270856v1_alt    5855373 6089898
chr17_GL383563v3_alt    7476487 7845527
chr17_JH159148v1_alt    3708868 3984631
chr17_JH159148v1_alt    7040113 7042904
chr17_KI270730v1_random 404268  854145
chr17_KI270730v1_random 3544067 3796807
chr17_KI270859v1_alt    3996864 4344945
chr17_KI270907v1_alt    1735306 2201566
chr18   1856815 2144546
chr18_KI270863v1_alt    3984764 4166006
chr19_GL383573v1_alt    7863497 7896279
chr19_GL383576v1_alt    4363702 4753141
chr19_GL949746v1_alt    8572018 8832793
chr19_GL949747v2_alt    1578276 2033118
chr19_GL949748v2_alt    3321575 3743545
chr19_KI270884v1_alt    2760167 3027068
chr19_KI270888v1_alt    3809944 3824324
chr19_KI270888v1_alt    9683785 10166473
chr19_KI270889v1_alt    8709969 8823256
chr19_KI270891v1_alt    7485890 7719006
chr19_KI270917v1_alt    7816269 7864474
chr19_KI270917v1_alt    9443280 9678387
chr19_KI270920v1_alt    5554949 5919294
chr19_KI270931v1_alt    1794178 2143519
chr19_KI270931v1_alt    6356216 6811953
chr19_KI270932v1_alt    8320855 8486835
chr19_KI270933v1_alt    1600444 2095219
chr20_GL383577v2_alt    3002112 3480396
chr20_GL383577v2_alt    3031797 3421627
chr20_KI270869v1_alt    2269342 2353172
chr22_KB663609v1_alt    4944645 4971886
chr22_KI270732v1_random 9187626 9418154
chrUn_GL000214v1        2468418 2728693
chrUn_GL000220v1        7612088 7825236
chrUn_KI270310v1        4229845 4626802
chrUn_KI270311v1        963681  1069745
chrUn_KI270322v1        1563181 1629375
chrUn_KI270322v1        9392841 9512920
chrUn_KI270335v1        7360266 7748998
chrUn_KI270366v1        3332191 3576201
chrUn_KI270378v1        8848225 9158581
chrUn_KI270387v1        5189812 5439563
chrUn_KI270387v1        7430161 7648856
chrUn_KI270419v1        5482364 5893900
chrUn_KI270438v1        8523562 8944980
chrUn_KI270507v1        599827  666674
chrUn_KI270517v1        3377374 3454859
chrUn_KI270521v1        6924036 7001552
chrUn_KI270544v1        7025954 7392905
chrUn_KI270582v1        6390491 6398266
chrUn_KI270589v1        5789405 6182867
chrUn_KI270591v1        9457975 9812609
chrUn_KI270741v1        1292450 1371680
chrX    4490524 4892623

https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html

CodePudding user response:

A gnu-awk approach that splits input into numerical and string arrays and then use 2 different sorting types:

awk '{
   v = $1
   gsub(/[a-z] /, "", v)
   if (v 0 == v)
      narr[v * 10^8   $2] = $0
   else
      sarr[v,$2] = $0
}
END {
   PROCINFO["sorted_in"]="@ind_num_asc"
   for (i in narr) print narr[i]
   PROCINFO["sorted_in"]="@ind_str_asc"
   for (i in sarr) print sarr[i]
}' file

chr1    1    100
chr1    5291    5500
chr2    33   52
chr2    555  1200
chr3    3    200
chr5    11   66
chr7    44   222
chr7    300   444
chr10   1    500
chr11   444  2141
chr16   224  5521
chr21   24   6023
chr22   99   1052
chrX    6234    79593
chrY    204  310

CodePudding user response:

This might be all you need if you don't mind letters sorting before numbers in a numeric sort (or can find a locale to use where that isn't the case):

$ cat tst.awk
{ vals[gensub(/chr/,"",1,$1)][$2][$3] }
END {
    OFS = "\t"

    PROCINFO["sorted_in"] = "@ind_num_asc"
    for (chr in vals) {
        for (beg in vals[chr]) {
            for (end in vals[chr][beg]) {
                print "chr"chr, beg, end
            }
        }
    }
}

$ awk -f tst.awk file
chrX    6234    79593
chrY    204     310
chr1    1       100
chr1    5291    5500
chr2    33      52
chr2    555     1200
chr3    3       200
chr5    11      66
chr7    44      222
chr7    300     444
chr10   1       500
chr11   444     2141
chr16   224     5521
chr21   24      6023
chr22   99      1052

The above uses GNU awk for arrays of arrays and sorted_in.

CodePudding user response:

Assumptions:

  • first field will always start with at least one letter and be followed by 0 or more numbers (ie, [:alpha:] [:digit:]*)
  • letters and digits will never be intermixed (eg, d23FgE55 won't occur)
  • first field is made up only of letters and numbers (ie, [:alnum:])
  • entire input file fits into ~60% of available memory (otherwise loading entire file into an awk array may risk an 'out of memory' error)

Adding a few more data points to the input:

$ cat dat.raw
chr1    1    100
chr10   1    500
chr2    33   52
bec9   3   17
chr5    11   66
chr22   99   1052
chr11   444  2141
chr2    555  1200
chr7    300   444
defA   3    15
def7   13    15
def7   3    15
chr7    44   222
chr21   24   6023
chr16   224  5521
chr3    3    200
chrX    6234    79593
chr1    5291    5500
chrY    204  310

Using GNU awk for array of arrays and PROCINFO["sorted_in"]:

awk '
     { keyL=gensub(/([[:digit:]])/,"","g",$1)    # strip numbers from 1st field
       keyN=gensub(/([[:alpha:]])/,"","g",$1)    # strip letters from 1st field

       if ( keyN == "" ) keyN=999999999          # if no numbers in 1st field use a really big number to insure these rows go to the end of the list

       arr[keyL][keyN][$2]=$0                    # arr[] index = 1st field letters / 1st field numbers / 2nd field
     }

END  { PROCINFO["sorted_in"]="@ind_str_asc"      # sort 1st index as string
       for (i in arr) {
           PROCINFO["sorted_in"]="@ind_num_asc"  # sort 2nd/3rd indices as numbers
           for (j in arr[i])
               for (k in arr[i][j])
                   print arr[i][j][k]
       }
     }
' dat.raw

This generates:

bec9   3   17
chr1    1    100
chr1    5291    5500
chr2    33   52
chr2    555  1200
chr3    3    200
chr5    11   66
chr7    44   222
chr7    300   444
chr10   1    500
chr11   444  2141
chr16   224  5521
chr21   24   6023
chr22   99   1052
chrX    6234    79593
chrY    204  310
def7   3    15
def7   13    15
defA   3    15

CodePudding user response:

The version sorting rules of GNU utilities are not trivial to fully implement.

Here is a ruby. It works on THIS example but does not fully implement the GNU -V method. It might be easier to customize than a GNU awk.

ruby -e 'ar=$<.read.split(/\r?\n/); 
puts ar.sort_by { |v| [v[/^[a-zA-Z] /]]   v.scan(/\d /).
    map{|e| e.to_i} }.
    join("\n")' file
chr1    1    100
chr1    5291    5500
chr2    33   52
chr2    555  1200
chr3    3    200
chr5    11   66
chr7    44   222
chr7    300   444
chr10   1    500
chr11   444  2141
chr16   224  5521
chr21   24   6023
chr22   99   1052
chrX    6234    79593
chrY    204  310
  • Related