Home > Software design >  How to extract mixed/partly absent records to a defined order with awk
How to extract mixed/partly absent records to a defined order with awk

Time:09-26

I have the following data (it also contains other lines, here is a meaningful extract):

group
bb 1
cc 1
dd 1
end
group
dd 2
bb 2
end
group
aa 3
end

I don't know the values (like "1", "2", etc.) and have to match by the names (generic "group", "aa", etc.) I want to get the data filtered and sorted in the following order (with empty tabs when the string is absent):

group       bb 1    cc 1    dd 1
group       bb 2            dd 2
group   aa 3            

I run:

awk 'BEGIN {ORS = "\t"}\
/^group/ {print "\n" $0}; \
/^aa/ {AA = $0}; \
/^bb/ {BB = $0}; \
/^cc/ {CC = $0}; \
/^dd/ {DD = $0}; \
/^end/ {print AA; print BB; print CC; print DD}' test.txt

and get

group       bb 1    cc 1    dd 1
group       bb 2    **cc 1**    dd 2
group   aa 3    **bb 2**    **cc 1**    **dd 2**

which is in the right order, but the data is wrong (marked with asterisks). What is the correct way to do this filtering? Thanks!

CodePudding user response:

Assumptions:

  • input lines do not start with any white space
  • each ^group has a matching ^end
  • the first line in the file is ^group
  • the last line in the file is ^end
  • there are no lines (to ignore) between ^end and the next ^group

Primary issue is that each time group is seen we need to clear/reset the other variables otherwise we carryover the values from the previous group.

Other (minor) issues:

  • ORS vs OFS
  • multiple print commands vs a single print command
  • no need for line continuation characters (\)

One idea for an updated awk script:

awk '
BEGIN    { OFS="\t" }
/^group/ { AA=BB=CC=DD="" ; next }
/^aa/    { AA=$0          ; next }
/^bb/    { BB=$0          ; next }
/^cc/    { CC=$0          ; next }
/^dd/    { DD=$0          ; next }
/^end/   { print "group",AA,BB,CC,DD }
' test.txt

NOTES: the ; next clauses are optional and are included as a visual reminder that we don't need to worry about the rest of the script (for the current line)

This generates:

group           bb 1    cc 1    dd 1
group           bb 2            dd 2
group   aa 3
  • Related