Home > Blockchain >  How to extract element from two similar tags in one code?
How to extract element from two similar tags in one code?

Time:03-19

I am trying to extract year from multiple xml files. Initially, the xml files are as follows,

<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2018v3.2">
  <ReturnHeader binaryAttachmentCnt="0">
    <!-- ... -->
    <TaxPeriodEndDt>2019-09-30</TaxPeriodEndDt>
    <!-- ... -->
  </ReturnHeader>
  <ReturnData documentCnt="12">
    <!-- ... -->
  </ReturnData>
</Return>

I used

year = root.find('.//irs:TaxPeriodEndDt',ns).text[:4]

It had worked well. But in some xml files the tag is changed to TaxPeriodEndDate

<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2018v3.2">
  <ReturnHeader binaryAttachmentCnt="0">
    <!-- ... -->
    <TaxPeriodEndDate>2012-09-30</TaxPeriodEndDate>
    <!-- ... -->
  </ReturnHeader>
  <ReturnData documentCnt="12">
    <!-- ... -->
  </ReturnData>
</Return>

I tried to revise the code to

year = root.find('.//irs:TaxPeriodEndDt|.//irs:TaxPeriodEndDate',ns).text[:4]

It did not work. No error message, but no output. Any suggestion is highly appreciated. Thank you.

CodePudding user response:

The support for xpath in ElementTree is very limited. The union operator (|) doesn't appear to work and other options, like using the self:: axis or name()/local-name() in a predicate, aren't supported.

I think your best bet is to use a try/except...

try:
    year = root.find(".//irs:TaxPeriodEndDt", ns).text[:4]
except AttributeError:
    year = root.find(".//irs:TaxPeriodEndDate", ns).text[:4]

If you can switch to lxml, your original attempt with the union operator will work with a few small changes (mainly use xpath() instead of find() and use the namespaces keyword arg)...

year = root.xpath(".//irs:TaxPeriodEndDt|.//irs:TaxPeriodEndDate", namespaces=ns)[0].text[:4]
  • Related