I have an XML file which looks like this:
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
<channel>
<item>
<title>Label: some_title"</title>
<link>some_link</link>
<pubDate>some_date</pubDate>
<dc:creator><![CDATA[University]]></dc:creator>
<guid isPermaLink="false">https://link.link</guid>
<description></description>
<content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some text<a href="https://link.link" target="_blank" rel="noopener noreferrer">text</a> some more text</strong><!--more-->
[caption id="attachment_344" align="aligncenter" width="524"]<img src="link.link.png" alt="" width="524" height="316" /> <em>A <a href="link.link" target="_blank" rel="noopener noreferrer">screenshot</a> by the people</em>[/caption]
<strong>some more text</strong>
<div >
<em>Leave your comments</em>
</div>
<div >
<div ></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>
I want to extract the raw text within the <content:encoded>
section, excluding the tags and urls. I have tried this with BeautifulSoup, and Scarpy, as well as other lxml methods. Most return an empty list.
Is there a way for me to retrieve this information without having to use regex?
Much appreciated.
UPDATE
I opened the XML file using:
content = []
with open(xml_file, "r") as file:
content = file.readlines()
content = "".join(content)
xml = bs(content, "lxml")
then I tried this with scrapy:
response = HtmlResponse(url=xml_file, encoding='utf-8')
response.selector.register_namespace('content',
'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()
which returns an empty list.
and tried the code in the first answer:
soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)
and get this error: Only the following pseudo-classes are implemented: nth-of-type.
When I opened the file with lxml, I ran this for loop:
data = {}
n = 0
for item in xml.findall('item'):
id = 'claim_id_' str(n)
keys = {}
title = item.find('title').text
keys['label'] = title.split(': ')[0]
keys['claim'] = title.split(': ')[1]
if item.find('content:encoded'):
keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
data[id] = keys
print(data)
n = 1
It saved the label and claim perfectly well, but nothing for the text. Now that I opened the file using BeautifulSoup, it returns this error: 'NoneType' object is not callable
CodePudding user response:
If you only need text inside <strong>
tags, you can use my example. Otherwise, only regex seems suitable here:
from bs4 import BeautifulSoup
xml_doc = """
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
...the XML from the question...
</rss>
"""
soup = BeautifulSoup(xml_doc, "xml")
soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)
Prints:
some text text some more text
some more text
RESEARCH | ARTICLE
CodePudding user response:
I eventually got the text part using regular expressions (regex).
import re
for item in root.iter('item'):
grandchildren = item.getchildren()
for grandchild in grandchildren:
if 'encoded' in grandchild.tag:
text = grandchild.text
text = re.sub(r'\[.*?\]', "", text) # gets rid of square brackets and their content
text = re.sub(r'\<.*?\>', "", text) # gets rid of <> signs and their content
text = text.replace(" ", "") # gets rid of
text = " ".join(text.split())