Home > Software engineering >  How to sort data based on the value of a column for part (multiple lines) of a file?
How to sort data based on the value of a column for part (multiple lines) of a file?

Time:04-01

My data in the file file1 look like

3
0
2 0.5
1 0.8
3 0.2
3
1
2 0.1
3 0.8
1 0.4
3
2
1 0.8
2 0.4
3 0.3

Each block has the same number of rows (Here it is 3 2 = 5). In each block, the first two lines are header, the next 3 rows have two columns, the first column is the label, which is one of the number from 1 to 3. I want to sort the rows in each block, based on the value of the first column (except the first two rows). So the expected result is

3 
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3

I think sort -k 1 -n file1 will be good for the total file. It gives me the wrong result:

0
1
2
3
3
3
2 0.1
3 0.2
3 0.3
1 0.4
2 0.4
2 0.5
1 0.8
1 0.8
3 0.8

This is not the expected result.

How to sort each block is still a problem for me. I think AWK is possible to perform this problem. Please give some suggestions.

CodePudding user response:

Apply the DSU (Decorate/Sort/Undecorate) idiom using any awk sort cut:

$ awk -v OFS='\t' '
    NF<pNF || NR==1 { blockNr   }
    { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }
' file |
sort -n -k1,1 -k2,2 -k4,4 -k3,3 |
cut -f5-
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3

To understand what that's doing, just look at the first 2 steps:

$ awk -v OFS='\t' 'NF<pNF || NR==1{ blockNr   } { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }' file
1       1       1       1       3
1       1       2       2       0
1       2       3       2       2 0.5
1       2       4       1       1 0.8
1       2       5       3       3 0.2
2       1       6       6       3
2       1       7       7       1
2       2       8       2       2 0.1
2       2       9       3       3 0.8
2       2       10      1       1 0.4
3       1       11      11      3
3       1       12      12      2
3       2       13      1       1 0.8
3       2       14      2       2 0.4
3       2       15      3       3 0.3

$ awk -v OFS='\t' 'NF<pNF || NR==1{ blockNr   } { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }' file |
    sort -n -k1,1 -k2,2 -k4,4 -k3,3
1       1       1       1       3
1       1       2       2       0
1       2       4       1       1 0.8
1       2       3       2       2 0.5
1       2       5       3       3 0.2
2       1       6       6       3
2       1       7       7       1
2       2       10      1       1 0.4
2       2       8       2       2 0.1
2       2       9       3       3 0.8
3       1       11      11      3
3       1       12      12      2
3       2       13      1       1 0.8
3       2       14      2       2 0.4
3       2       15      3       3 0.3

and notice that the awk command is just creating the key values that you need for sort to sort on by block number, line number or $1, etc. So awk Decorates the input, sort Sorts it, and cut Undecorates it by removing the decoration values that the awk script added.

CodePudding user response:

You can use sort and arrays in gawk

awk 'NF==1 && a[1]{
        n=asort(a); 
        for(k=1; k<=n; k  ){print a[k]}; 
        delete a; i=1
    }NF==1{print}
    NF==2{a[i]=$0;  i}
    END{n=asort(a); for(k=1; k<=n; k  ){print a[k]}}
' file1

you get

3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
  • Related