Recently, I had to sort several files according to records' ID; the catch was that there can be several types of records, and in each of those the field I had to use for sorting is on a different position. The fields, however, are easily identifiable thanks to key=value
structure. To show a simple sample of the general structure:
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
I came up with a pipeline as follows, which did the job:
awk -F'[|=]' '{for(i=1; i<=NF; i ) {if($i ~ "id") {i ; print $i"?"$0} }}' tester.txt | sort -n | awk -F'?' '{print $2}'
In other words the algorithm is as follows:
- Split the record by both field and key-value separators (
|
and=
) - Iterate through the elements and search for the
id
key - Print the next element (value of
id
key), a separator, and the whole line - Sort numerically
- Remove prepended identifier to preserve records' structure
Processing the sample gives the output:
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
Is there a way, though, to do this task using single awk
command?
CodePudding user response:
You may try this gnu-awk
code to to this in a single command:
awk -F'|' '{
for(i=1; i<=NF; i)
if ($i ~ /^id=/) {
a[gensub(/^id=/, "", 1, $i)] = $0
break
}
}
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for (i in a)
print a[i]
}' file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
We are using |
as field delimiter and when there is a column name starting with id=
we store it in array a
with index as text after =
and value as the full record.
Using PROCINFO["sorted_in"] = "@ind_num_asc"
we sort array a
using numerical value of index and then in for loop we print value part to get the sorted output.
CodePudding user response:
With your shown samples only, please try following(awk
sort
cut
) solution, written and tested in GNU awk
, should work in any awk
.
awk '
match($0,/id=[0-9] /){
print substr($0,RSTART,RLENGTH)";"$0
}
' Input_file | sort -t'=' -k2n | cut -d';' -f2-
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/id=[0-9] /){ ##Using awk match function to match id= followed by digits.
print substr($0,RSTART,RLENGTH)";"$0 ##printing sub string of matched value followed by current line along with semi-colon in it.
}
' Input_file | ##Mentioning Input_file here and passing awk output as a standard input to next command.
sort -t'=' -k2n | ##Sorting output with delimiter of = and by 2nd field then passing output to next command as an input.
cut -d';' -f2- ##Using cut command making delimiter as ; and printing everything from 2nd field onwards.
CodePudding user response:
Using GNU awk for the 3rd arg to match()
and sorted_in
:
$ cat tst.awk
match($0,/(^|\|)id=([0-9] )/,a) {
ids2vals[a[2]] = $0
}
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for ( id in ids2vals ) {
print ids2vals[id]
}
}
$ awk -f tst.awk file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
CodePudding user response:
Try Perl: perl -e 'print map { s/^.*? //; $_ } sort { $a <=> $b } map { ($id) = /id=(\d )/; "$id $_" } <>' file
Some explanation of the code I use:
print #print the resulting list of lines
map {
s/^.*? //;
$_
} #remove numeric id from start of line
sort { $a <=> $b } #sort numerically
map {
($id) = /id=(\d )/;
"$id $_"
} # capture id and place it in start of line
<> # read all lines from file
Or try sed
and sort
: sed 's/^\(.*id=\([0-9][0-9]*\).*\)$/\2 \1/' file | sort -n | sed 's/^[^ ][^ ]* //'