Home > Software engineering >  Retrieving text data from <content:encoded> in XML file
Retrieving text data from <content:encoded> in XML file

Time:05-23

I have an XML file which looks like this:

<rss version="2.0"
    xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.2/"
>

<channel>

<item>
        <title>Label: some_title&quot;</title>
        <link>some_link</link>
        <pubDate>some_date</pubDate>
        <dc:creator><![CDATA[University]]></dc:creator>
        <guid isPermaLink="false">https://link.link</guid>
        <description></description>
        <content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some text<a href="https://link.link" target="_blank" rel="noopener noreferrer">text</a> some more text</strong><!--more-->

[caption id="attachment_344" align="aligncenter" width="524"]<img  src="link.link.png" alt="" width="524" height="316" /> <em>A <a href="link.link" target="_blank" rel="noopener noreferrer">screenshot</a> by the people</em>[/caption]

&nbsp;

<strong>some more text</strong>

&nbsp;
<div >

<em>Leave your comments</em>

</div>
<div >
<div ></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
        <excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>

I want to extract the raw text within the <content:encoded> section, excluding the tags and urls. I have tried this with BeautifulSoup, and Scarpy, as well as other lxml methods. Most return an empty list.

Is there a way for me to retrieve this information without having to use regex?

Much appreciated.

UPDATE

I opened the XML file using:

content = []
with open(xml_file, "r") as file:
    content = file.readlines()
    content = "".join(content)
    xml = bs(content, "lxml")

then I tried this with scrapy:

response = HtmlResponse(url=xml_file, encoding='utf-8')

response.selector.register_namespace('content', 
                                     'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()

which returns an empty list.

and tried the code in the first answer:

soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
    s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)

and get this error: Only the following pseudo-classes are implemented: nth-of-type.

When I opened the file with lxml, I ran this for loop:

data = {}
n = 0

for item in xml.findall('item'):
  id = 'claim_id_'   str(n)
  keys = {}
  title = item.find('title').text
  keys['label'] = title.split(': ')[0]
  keys['claim'] = title.split(': ')[1]
  if item.find('content:encoded'):
    keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
  data[id] = keys
  print(data)
  n  = 1

It saved the label and claim perfectly well, but nothing for the text. Now that I opened the file using BeautifulSoup, it returns this error: 'NoneType' object is not callable

CodePudding user response:

If you only need text inside <strong> tags, you can use my example. Otherwise, only regex seems suitable here:

from bs4 import BeautifulSoup

xml_doc = """
<rss version="2.0"
    xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.2/"
>

...the XML from the question...

</rss>
"""

soup = BeautifulSoup(xml_doc, "xml")

soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")

text = "\n".join(
    s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)

Prints:

some text text some more text
some more text
RESEARCH | ARTICLE

CodePudding user response:

I eventually got the text part using regular expressions (regex).

import re

for item in root.iter('item'):
  grandchildren = item.getchildren()
  for grandchild in grandchildren:
    if 'encoded' in grandchild.tag:
      text = grandchild.text
      text = re.sub(r'\[.*?\]', "", text)   # gets rid of square brackets and their content
      text = re.sub(r'\<.*?\>', "", text)   # gets rid of <> signs and their content
      text = text.replace("&nbsp;", "")   # gets rid of &nbsp;
      text = " ".join(text.split())
  • Related