Sorting a file using fields with specific value-CodePudding

Recently, I had to sort several files according to records' ID; the catch was that there can be several types of records, and in each of those the field I had to use for sorting is on a different position. The fields, however, are easily identifiable thanks to key=value structure. To show a simple sample of the general structure:

fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

I came up with a pipeline as follows, which did the job:

awk -F'[|=]' '{for(i=1; i<=NF; i  ) {if($i ~ "id") {i  ; print $i"?"$0} }}' tester.txt | sort -n | awk -F'?' '{print $2}'

In other words the algorithm is as follows:

Split the record by both field and key-value separators (| and =)
Iterate through the elements and search for the id key
Print the next element (value of id key), a separator, and the whole line
Sort numerically
Remove prepended identifier to preserve records' structure

Processing the sample gives the output:

fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

Is there a way, though, to do this task using single awk command?

CodePudding user response：

You may try this gnu-awk code to to this in a single command:

awk -F'|' '{
   for(i=1; i<=NF;   i)
      if ($i ~ /^id=/) {
         a[gensub(/^id=/, "", 1, $i)] = $0
         break
      }
}
END {
   PROCINFO["sorted_in"] = "@ind_num_asc"
   for (i in a)
      print a[i]
}' file

fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

We are using | as field delimiter and when there is a column name starting with id= we store it in array a with index as text after = and value as the full record.

Using PROCINFO["sorted_in"] = "@ind_num_asc" we sort array a using numerical value of index and then in for loop we print value part to get the sorted output.

CodePudding user response：

With your shown samples only, please try following(awk sort cut) solution, written and tested in GNU awk, should work in any awk.

awk '
match($0,/id=[0-9] /){
  print substr($0,RSTART,RLENGTH)";"$0
}
' Input_file | sort -t'=' -k2n | cut -d';' -f2-

Explanation: Adding detailed explanation for above code.

awk '                                   ##Starting awk program from here.
match($0,/id=[0-9] /){                  ##Using awk match function to match id= followed by digits.
  print substr($0,RSTART,RLENGTH)";"$0  ##printing sub string of matched value followed by current line along with semi-colon in it.
}
' Input_file    |                       ##Mentioning Input_file here and passing awk output as a standard input to next command.
sort -t'=' -k2n |                       ##Sorting output with delimiter of = and by 2nd field then passing output to next command as an input.
cut -d';' -f2-                          ##Using cut command making delimiter as ; and printing everything from 2nd field onwards.

CodePudding user response：

Using GNU awk for the 3rd arg to match() and sorted_in:

$ cat tst.awk
match($0,/(^|\|)id=([0-9] )/,a) {
    ids2vals[a[2]] = $0
}
END {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for ( id in ids2vals ) {
        print ids2vals[id]
    }
}

$ awk -f tst.awk file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

CodePudding user response：

Try Perl: perl -e 'print map { s/^.*? //; $_ } sort { $a <=> $b } map { ($id) = /id=(\d )/; "$id $_" } <>' file

Some explanation of the code I use:

print #print the resulting list of lines
    map {
        s/^.*? //;
        $_
    } #remove numeric id from start of line
    sort { $a <=> $b } #sort numerically
    map {
        ($id) = /id=(\d )/;
        "$id $_"
    } # capture id and place it in start of line
    <> # read all lines from file

Or try sed and sort: sed 's/^$.*id=\([0-9][0-9]*$.*\)$/\2 \1/' file | sort -n | sed 's/^[^ ][^ ]* //'