For getting the attribute value from the below mentioned xml for attribute code from tag c
random.xml
<a>
<b>
<c id="123" code="abc" date="12-12-2022"/>
<c id="123" code="efg" date="12-12-2022"/>
<c id="123" date="12-12-2022"/>
</b>
</a>
Currently the logic is:
cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'
How does the above logic work to get the values of code from tag c?
Getting the expected output:
abc
efg
CodePudding user response:
This might be an awk
question but parsing XML should be done with XML tools.
Here's an example with Xidel (available here for a few OSs) and a standard XPath expression:
xidel --xpath '//c[@code]/@code' random.xml
note: //c[@code]
selects the c
nodes that have a code
attribute, and .../@code
outputs the value of the code
attribute.
Output
abc
efg
CodePudding user response:
Firstly observe that
cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'
is of dubious quality, as
egrep
does not require standard input, it can read file itself, so you have useless use ofcat
- simple pattern is used in
egrep
which will work equally well in commongrep
, no need to summon ehanced grep, this usage is overkill 1
is set as field separator inawk
, but code does not make any use of fields mechanism
after fixing these issue code looks following way
grep "<c.*/>" random.xml | awk ' /code=/ {f=NR} f&&NR-1==f' RS='"'
How it does work: select lines which contain <c
followed by zero-or-more any characters followed by />
, then instruct awk
that row are separated by qoutes ("
) when row does contain code=
set f
variable value to number of row, print such row that f
is set to non-zero value and f
value is equal to current number of lines minus one, which does mean print
rows which are directly after row containing code=
.
Observe that GNU AWK
is poorly suited for working with XML and using regular expression against XML is very poor idea, as XML is not Chomsky Type 3 contraption.
If possible use proper tools for working with XML data, e.g. hxselect
might be used following way, let file.xml
content be
<a>
<b>
<c id="123" code="abc" date="12-12-2022"/>
<c id="123" code="efg" date="12-12-2022"/>
<c id="123" date="12-12-2022"/>
</b>
</a>
then
hxselect -c -s '\n' 'c[code]::attr(code)' < file.xml
gives output
abc
efg
Explanation: -c
get just value rather than name and value, -s '\n'
shear using newline, i.e. each value will be on own line c[code]
is CSS3 selector meaning any c
tag with attribute code
, ::attr(code)
is hxselect
feature meaning get attribute named code
. Observe that this solution is more robust than peculiar cat-egrep-awk pipeline as is immune to e.g. other whitespace usage in file (whitespaces outside tags in XML are optional).
CodePudding user response:
If your input always looks likes the sample XML then you can make the code
attribute itself a field separator, and <
the record separator, so that you can easily extract the value as the second field when the first field is the tag name c
:
awk -F' .*code="|" ' -vRS='<' '$1=="c"{print $2}'