I have the following data (it also contains other lines, here is a meaningful extract):
group
bb 1
cc 1
dd 1
end
group
dd 2
bb 2
end
group
aa 3
end
I don't know the values (like "1", "2", etc.) and have to match by the names (generic "group", "aa", etc.) I want to get the data filtered and sorted in the following order (with empty tabs when the string is absent):
group bb 1 cc 1 dd 1
group bb 2 dd 2
group aa 3
I run:
awk 'BEGIN {ORS = "\t"}\
/^group/ {print "\n" $0}; \
/^aa/ {AA = $0}; \
/^bb/ {BB = $0}; \
/^cc/ {CC = $0}; \
/^dd/ {DD = $0}; \
/^end/ {print AA; print BB; print CC; print DD}' test.txt
and get
group bb 1 cc 1 dd 1
group bb 2 **cc 1** dd 2
group aa 3 **bb 2** **cc 1** **dd 2**
which is in the right order, but the data is wrong (marked with asterisks). What is the correct way to do this filtering? Thanks!
CodePudding user response:
Assumptions:
- input lines do not start with any white space
- each
^group
has a matching^end
- the first line in the file is
^group
- the last line in the file is
^end
- there are no lines (to ignore) between
^end
and the next^group
Primary issue is that each time group
is seen we need to clear/reset the other variables otherwise we carryover the values from the previous group
.
Other (minor) issues:
ORS
vsOFS
- multiple
print
commands vs a singleprint
command - no need for line continuation characters (
\
)
One idea for an updated awk
script:
awk '
BEGIN { OFS="\t" }
/^group/ { AA=BB=CC=DD="" ; next }
/^aa/ { AA=$0 ; next }
/^bb/ { BB=$0 ; next }
/^cc/ { CC=$0 ; next }
/^dd/ { DD=$0 ; next }
/^end/ { print "group",AA,BB,CC,DD }
' test.txt
NOTES: the ; next
clauses are optional and are included as a visual reminder that we don't need to worry about the rest of the script (for the current line)
This generates:
group bb 1 cc 1 dd 1
group bb 2 dd 2
group aa 3