Home > Enterprise >  python accessing the attributes of tags parsing XML with xpath
python accessing the attributes of tags parsing XML with xpath

Time:02-26

I am parsing an XML file with this shape:

from lxml import etree
mystring='''<div n="0001" type="doc" xml:id="_3168060002">
<p xml:id="_3168060003">[car 1] Séquence préparatoire pour <p xml:id="_3168060005">a) la définition </p></p></p></div>
<div n="0002" type="doc" xml:id="_3168060012"><p xml:id="_3168060003">[blue] la voiture pour <p xml:id="_3168060005">a) la définition </p></p></p></div>

I would like to catch whatever is inside a div follow by p tag BUT also the n attribute of div. My parsing strategy is as follows:

parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(claims_PDM.encode() , parser=parser)
paragraphs = './/div[@n]/p[@xml:id]'
xml_query = paragraphs
all_paras = XML_tree.xpath(xml_query)
for para in all_paras:
    print(para.tag)

It works, but I dont know how to extract at the same time all what is inside the p tag and also the n attribute of div since the tag and atrributes of the element are the ones of p and not div.

Any Idea how can I access the attributes of the parent of an element?

Thanks.

CodePudding user response:

Consider running xpath on the <div> level, then parse child <p> and attribute @n items separately. Below runs a list/dictionary comprehension to return a list of dictionaries for needed items. Also, the example XML was fixed with a root and extra </p> closing tag:

from lxml import etree

mystring='''\
<root>
    <div n="0001" type="doc" xml:id="_3168060002">
       <p xml:id="_3168060003">[car 1] Séquence préparatoire pour <p xml:id="_3168060005">a) la définition </p></p>
    </div>
    <div n="0002" type="doc" xml:id="_3168060012">
       <p xml:id="_3168060003">[blue] la voiture pour <p xml:id="_3168060005">a) la définition </p></p>
    </div>
</root>'''

parser = etree.XMLParser(
    resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True
)
XML_tree = etree.fromstring(mystring, parser=parser)

all_divs = XML_tree.xpath('.//div')
all_divs

div_dict = [
    {'div': div.find("p").text if div.find("p") else None,
     'n': div.attrib["n"]} 
    for div in all_divs
]
    
div_dict
# [{'div': '[car 1] Séquence préparatoire pour ', 'n': '0001'},
#  {'div': '[blue] la voiture pour ', 'n': '0002'}]

CodePudding user response:

A simple alternative:

for car in XML_tree.xpath('//div[@n]'):
    print(car.xpath('@n')[0],car.xpath('normalize-space(.//p[@*[local-name()="xml:id"]]/text())'))

Output:

0001 [car 1] Séquence préparatoire pour
0002 [blue] la voiture pour
  • Related