Home > Back-end >  How to get attribute values of preceding and following XML tags even when a sibling does not exist?
How to get attribute values of preceding and following XML tags even when a sibling does not exist?

Time:12-19

Say you have a bunch of documents that could have any of the three following structures:

DOC1

    <text>
    
    <tagX id="ABB"> something </tagX> 
    <tagX id="BZX"> something </tagX> 
    <tagX id="CDN"> something </tagX> 
    
    </text>
DOC2

    <text>
    
    <tagX id="AA"> 
         <tagZ id="BC"> something </tagZ> 
    </tagX> 
    <tagZ id="C"> something </tagZ> 
    
    </text>
DOC3

    <text>
    
    <tagX id="BDD"> something </tagX> 
    <tagX id="CXC"> something </tagX> 
    <tagX id="D"> something </tagX> 
    
    </text>

The script I'm trying to develop should get the values of the 'id' attributes in the preceding and following XML tags whenever any tag has a value starting with 'B' for its 'id' attribute.

I have been successful using LXML for cases like DOC1 where the preceding and following XML tags are of the same kind. I used something along the lines of:

for el in root.xpath('//tagX[starts-with(@id, "B")]'):
     prec_tag = el.xpath('preceding::tagX[1]')[0].get('id')
     foll_tag = el.xpath('following::tagX[1]')[0].get('id')


This was before I realized in some documents the attributes I'm interested in were either within different kinds of tags (as in DOC2) or where there was no other tag preceding it (as in DOC3).

I knew I had a problem when I got "IndexError: list index out of range". Then I realized I had some cases like those in DOC3.

The general solutions I've found in my searches for solutions to the IndexError problem don't work in this particular context and I have not been able to find out how to get the values of attributes when the preceding tag is not really a sibling. I've read the LXML documentation but I can't really figure out how to do this.

Is there any way in which I could get "ABB" and "CDN" for DOC1, "AA" and "C" for DOC2 and only "CXC" for DOC3?

Thanks in advance for your help.

CodePudding user response:

Perhaps use e.g.

prec_tag = el.xpath('preceding::tagX[1]')[0].get('id') if el.xpath('preceding::tagX[1]') else None`.
  • Related