Awk to get the attribute value from XML file-CodePudding

For getting the attribute value from the below mentioned xml for attribute code from tag c

random.xml

<a>
    <b>
        <c id="123" code="abc" date="12-12-2022"/>
        <c id="123" code="efg" date="12-12-2022"/>
        <c id="123" date="12-12-2022"/>
    </b>
</a>

Currently the logic is:

cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'

How does the above logic work to get the values of code from tag c?

Getting the expected output:

abc
efg

CodePudding user response：

This might be an awk question but parsing XML should be done with XML tools.

Here's an example with Xidel (available here for a few OSs) and a standard XPath expression:

xidel --xpath '//c[@code]/@code' random.xml

^{note: //c[@code] selects the c nodes that have a code attribute, and .../@code outputs the value of the code attribute.}

Output

abc
efg

CodePudding user response：

Firstly observe that

cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f'  RS='"'

is of dubious quality, as

egrep does not require standard input, it can read file itself, so you have useless use of cat
simple pattern is used in egrep which will work equally well in common grep, no need to summon ehanced grep, this usage is overkill
1 is set as field separator in awk, but code does not make any use of fields mechanism

after fixing these issue code looks following way

grep "<c.*/>" random.xml | awk ' /code=/ {f=NR} f&&NR-1==f'  RS='"'

How it does work: select lines which contain <c followed by zero-or-more any characters followed by />, then instruct awk that row are separated by qoutes (") when row does contain code= set f variable value to number of row, print such row that f is set to non-zero value and f value is equal to current number of lines minus one, which does mean print rows which are directly after row containing code=.

Observe that GNU AWK is poorly suited for working with XML and using regular expression against XML is very poor idea, as XML is not Chomsky Type 3 contraption.

If possible use proper tools for working with XML data, e.g. hxselect might be used following way, let file.xml content be

<a>
    <b>
        <c id="123" code="abc" date="12-12-2022"/>
        <c id="123" code="efg" date="12-12-2022"/>
        <c id="123" date="12-12-2022"/>
    </b>
</a>

then

hxselect -c -s '\n' 'c[code]::attr(code)' < file.xml

gives output

abc
efg

Explanation: -c get just value rather than name and value, -s '\n' shear using newline, i.e. each value will be on own line c[code] is CSS3 selector meaning any c tag with attribute code, ::attr(code) is hxselect feature meaning get attribute named code. Observe that this solution is more robust than peculiar cat-egrep-awk pipeline as is immune to e.g. other whitespace usage in file (whitespaces outside tags in XML are optional).

CodePudding user response：

If your input always looks likes the sample XML then you can make the code attribute itself a field separator, and < the record separator, so that you can easily extract the value as the second field when the first field is the tag name c:

awk -F' .*code="|" ' -vRS='<' '$1=="c"{print $2}'

Demo: https://awk.js.org/?snippet=Lz6yx7