I am trying to extract year from multiple xml files. Initially, the xml files are as follows,
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2018v3.2">
<ReturnHeader binaryAttachmentCnt="0">
<!-- ... -->
<TaxPeriodEndDt>2019-09-30</TaxPeriodEndDt>
<!-- ... -->
</ReturnHeader>
<ReturnData documentCnt="12">
<!-- ... -->
</ReturnData>
</Return>
I used
year = root.find('.//irs:TaxPeriodEndDt',ns).text[:4]
It had worked well. But in some xml files the tag is changed to TaxPeriodEndDate
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2018v3.2">
<ReturnHeader binaryAttachmentCnt="0">
<!-- ... -->
<TaxPeriodEndDate>2012-09-30</TaxPeriodEndDate>
<!-- ... -->
</ReturnHeader>
<ReturnData documentCnt="12">
<!-- ... -->
</ReturnData>
</Return>
I tried to revise the code to
year = root.find('.//irs:TaxPeriodEndDt|.//irs:TaxPeriodEndDate',ns).text[:4]
It did not work. No error message, but no output. Any suggestion is highly appreciated. Thank you.
CodePudding user response:
The support for xpath in ElementTree is very limited. The union operator (|
) doesn't appear to work and other options, like using the self::
axis or name()
/local-name()
in a predicate, aren't supported.
I think your best bet is to use a try/except...
try:
year = root.find(".//irs:TaxPeriodEndDt", ns).text[:4]
except AttributeError:
year = root.find(".//irs:TaxPeriodEndDate", ns).text[:4]
If you can switch to lxml, your original attempt with the union operator will work with a few small changes (mainly use xpath()
instead of find()
and use the namespaces
keyword arg)...
year = root.xpath(".//irs:TaxPeriodEndDt|.//irs:TaxPeriodEndDate", namespaces=ns)[0].text[:4]