Home > database >  Awk to get the attribute value from XML file
Awk to get the attribute value from XML file

Time:01-11

For getting the attribute value from the below mentioned xml for attribute code from tag c

random.xml

<a>
    <b>
        <c id="123" code="abc" date="12-12-2022"/>
        <c id="123" code="efg" date="12-12-2022"/>
        <c id="123" date="12-12-2022"/>
    </b>
</a>

Currently the logic is:

cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'

How does the above logic work to get the values of code from tag c?

Getting the expected output:

abc
efg

CodePudding user response:

This might be an awk question but parsing XML should be done with XML tools.

Here's an example with Xidel (available here for a few OSs) and a standard XPath expression:

xidel --xpath '//c[@code]/@code' random.xml

note: //c[@code] selects the c nodes that have a code attribute, and .../@code outputs the value of the code attribute.

Output
abc
efg

CodePudding user response:

Firstly observe that

cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f'  RS='"'

is of dubious quality, as

  • egrep does not require standard input, it can read file itself, so you have useless use of cat
  • simple pattern is used in egrep which will work equally well in common grep, no need to summon ehanced grep, this usage is overkill
  • 1 is set as field separator in awk, but code does not make any use of fields mechanism

after fixing these issue code looks following way

grep "<c.*/>" random.xml | awk ' /code=/ {f=NR} f&&NR-1==f'  RS='"'

How it does work: select lines which contain <c followed by zero-or-more any characters followed by />, then instruct awk that row are separated by qoutes (") when row does contain code= set f variable value to number of row, print such row that f is set to non-zero value and f value is equal to current number of lines minus one, which does mean print rows which are directly after row containing code=.

Observe that GNU AWK is poorly suited for working with XML and using regular expression against XML is very poor idea, as XML is not Chomsky Type 3 contraption.

If possible use proper tools for working with XML data, e.g. hxselect might be used following way, let file.xml content be

<a>
    <b>
        <c id="123" code="abc" date="12-12-2022"/>
        <c id="123" code="efg" date="12-12-2022"/>
        <c id="123" date="12-12-2022"/>
    </b>
</a>

then

hxselect -c -s '\n' 'c[code]::attr(code)' < file.xml

gives output

abc
efg

Explanation: -c get just value rather than name and value, -s '\n' shear using newline, i.e. each value will be on own line c[code] is CSS3 selector meaning any c tag with attribute code, ::attr(code) is hxselect feature meaning get attribute named code. Observe that this solution is more robust than peculiar cat-egrep-awk pipeline as is immune to e.g. other whitespace usage in file (whitespaces outside tags in XML are optional).

CodePudding user response:

If your input always looks likes the sample XML then you can make the code attribute itself a field separator, and < the record separator, so that you can easily extract the value as the second field when the first field is the tag name c:

awk -F' .*code="|" ' -vRS='<' '$1=="c"{print $2}'

Demo: https://awk.js.org/?snippet=Lz6yx7

  • Related