Home > Blockchain >  Python, ElementTree: Find specific content in XML tag?
Python, ElementTree: Find specific content in XML tag?

Time:07-18

I'm trying to do something I thought should be very simple in ElementTree: find elements with specific tag content. The docs give the example:

*[tag='text']* Selects all elements that have a child named *tag* whose complete text content, including descendants, equals the given *text*.

Which seems straightforward enough. However, it does not work as I expect. Suppose I want to find all examples of <note>NEW</note>. The following complete example:

#!/usr/bin/env python
import xml.etree.ElementTree as ET

xml = """<?xml version="1.0"?>
<entry>
<foo>blah</foo>
<foo>bblic</foo>
<foo>fjdks<note>NEW</note></foo>
<foo>fdfsd</foo>
<foo>ljklj<note>NEW</note></foo>
</entry>
"""

root = ET.fromstring(xml)

print("Number of 'foo' elements: %d" % len(root.findall('.//foo')))
print("Number of new 'foo' elements: %d" % len(root.findall('.//[note="NEW"]')))

Yields:

$ python foo.py 
Number of 'foo' elements: 5
Traceback (most recent call last):
  File "/usr/lib/python3.10/xml/etree/ElementPath.py", line 370, in iterfind
    selector = _cache[cache_key]
KeyError: ('.//[note="NEW"]',)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/foo.py", line 17, in <module>
    print("Number of new 'foo' elements: %d" % len(root.findall('.//[note="NEW"]')))
  File "/usr/lib/python3.10/xml/etree/ElementPath.py", line 411, in findall
    return list(iterfind(elem, path, namespaces))
  File "/usr/lib/python3.10/xml/etree/ElementPath.py", line 384, in iterfind
    selector.append(ops[token[0]](next, token))
  File "/usr/lib/python3.10/xml/etree/ElementPath.py", line 193, in prepare_descendant
    raise SyntaxError("invalid descendant")
SyntaxError: invalid descendant

How am I meant to do this simple task?

CodePudding user response:

docs says also that

Predicates (expressions within square brackets) must be preceded by a tag name, an asterisk, or another predicate.

taking this is account

root.findall('.//[note="NEW"]')

is illegal, you should add * before [ to denote any tag i.e.

root.findall('.//*[note="NEW"]')

xor use tag name before [ to denote certain tag i.e.

root.findall('.//foo[note="NEW"]')

CodePudding user response:

The main problem seems an expected dependency from first to second search, which does not exist.

This works (but used syntax requires Python >=3.10):

for foo in root.findall('.//foo[note="NEW"]'):
    print(foo.text)
  • Related