Home > Net >  Sorting a file using fields with specific value
Sorting a file using fields with specific value

Time:05-19

Recently, I had to sort several files according to records' ID; the catch was that there can be several types of records, and in each of those the field I had to use for sorting is on a different position. The fields, however, are easily identifiable thanks to key=value structure. To show a simple sample of the general structure:

fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

I came up with a pipeline as follows, which did the job:

awk -F'[|=]' '{for(i=1; i<=NF; i  ) {if($i ~ "id") {i  ; print $i"?"$0} }}' tester.txt | sort -n | awk -F'?' '{print $2}'

In other words the algorithm is as follows:

  1. Split the record by both field and key-value separators (| and =)
  2. Iterate through the elements and search for the id key
  3. Print the next element (value of id key), a separator, and the whole line
  4. Sort numerically
  5. Remove prepended identifier to preserve records' structure

Processing the sample gives the output:

fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

Is there a way, though, to do this task using single awk command?

CodePudding user response:

You may try this gnu-awk code to to this in a single command:

awk -F'|' '{
   for(i=1; i<=NF;   i)
      if ($i ~ /^id=/) {
         a[gensub(/^id=/, "", 1, $i)] = $0
         break
      }
}
END {
   PROCINFO["sorted_in"] = "@ind_num_asc"
   for (i in a)
      print a[i]
}' file

fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

We are using | as field delimiter and when there is a column name starting with id= we store it in array a with index as text after = and value as the full record.

Using PROCINFO["sorted_in"] = "@ind_num_asc" we sort array a using numerical value of index and then in for loop we print value part to get the sorted output.

CodePudding user response:

With your shown samples only, please try following(awk sort cut) solution, written and tested in GNU awk, should work in any awk.

awk '
match($0,/id=[0-9] /){
  print substr($0,RSTART,RLENGTH)";"$0
}
' Input_file | sort -t'=' -k2n | cut -d';' -f2-

Explanation: Adding detailed explanation for above code.

awk '                                   ##Starting awk program from here.
match($0,/id=[0-9] /){                  ##Using awk match function to match id= followed by digits.
  print substr($0,RSTART,RLENGTH)";"$0  ##printing sub string of matched value followed by current line along with semi-colon in it.
}
' Input_file    |                       ##Mentioning Input_file here and passing awk output as a standard input to next command.
sort -t'=' -k2n |                       ##Sorting output with delimiter of = and by 2nd field then passing output to next command as an input.
cut -d';' -f2-                          ##Using cut command making delimiter as ; and printing everything from 2nd field onwards.     

CodePudding user response:

Using GNU awk for the 3rd arg to match() and sorted_in:

$ cat tst.awk
match($0,/(^|\|)id=([0-9] )/,a) {
    ids2vals[a[2]] = $0
}
END {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for ( id in ids2vals ) {
        print ids2vals[id]
    }
}

$ awk -f tst.awk file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

CodePudding user response:

Try Perl: perl -e 'print map { s/^.*? //; $_ } sort { $a <=> $b } map { ($id) = /id=(\d )/; "$id $_" } <>' file

Some explanation of the code I use:

print #print the resulting list of lines
    map {
        s/^.*? //;
        $_
    } #remove numeric id from start of line
    sort { $a <=> $b } #sort numerically
    map {
        ($id) = /id=(\d )/;
        "$id $_"
    } # capture id and place it in start of line
    <> # read all lines from file

Or try sed and sort: sed 's/^\(.*id=\([0-9][0-9]*\).*\)$/\2 \1/' file | sort -n | sed 's/^[^ ][^ ]* //'

  • Related