Home > other >  Python etree.ElementTree extracted XML text is truncated when text contains HTML tags
Python etree.ElementTree extracted XML text is truncated when text contains HTML tags

Time:10-31

I am scraping pubmed xml docs using python's xml.etree.ElementTree. The presence of html formatting elements embedded in text results in fragmented text being returned for a given xml element. The following xml element is only returning text up to the italics tag.

<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>

Here is example code that works but is unable to return a complete record that contains html.

import xml.etree.ElementTree as ET
xmldata = 'directory/to/data.xml'
tree = ET.parse(xmldata)
root = tree.getroot()

abstracts = {}

for i in range(len(root)):
    for child in root[i].iter():
        if child.tag == 'ArticleTitle':
            title = child.text
            titles[i] = title

I have also attempted something similar with child.xpath('//AbstractText/text()') using lxml.etree. This returns all text in the document as list elements but with no clear way to combine elements into the original abstract (i.e., 3 abstracts can potentially return 3x list elements.

CodePudding user response:

the answer is itertext() --> To collect the inner text of an element.

so code would be like:

import xml.etree.ElementTree as ET
from io import StringIO

raw_data="""
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
"""
tree = ET.parse(StringIO(raw_data))
root = tree.getroot()
# in the element there is child element, that is reason text was comming till <i>
for e in root.findall("."):
    print(e.text,type(e))

Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <class 'xml.etree.ElementTree.Element'>

by using itertext()

"".join(root.find(".").itertext()) # "".join(element.itertext())

'Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which Microdochium species are the most harmful.'

  • Related