I'm trying to create a web scraper that returns articles only if there is a certain keyword in the title from an rss feed (xml format). However, whenever I run the code it returns blank, even if the title of the article by itself runs correctly (for example the title will print properly, but when I ask it to return only if there is the word "said" in the title, nothing returns even if the word "said" is in fact in the title.
Code:
xml_text = requests.get('https://nypost.com/feed/').text
soup = BeautifulSoup(xml_text, 'xml')
ny_rss_search = soup.find_all("Mark")
ny_rss_title3 = soup.find_all('title')
ny_rss_url3 = soup.find_all('link')
ny_rss_summary3 = soup.find_all('description')
ny_rss_url_compact3 = ny_rss_url3[2].text.strip()
if 'Guide' in ny_rss_title3:
webbrowser.open(ny_rss_url_compact3, new=2)
print(f'NY Post Article Title: {ny_rss_title3[1].text.strip()}\n')
print(f"NY Post Article URL: {ny_rss_url3[2].text.strip()}\n")
print(f'NY Post Article Summary: {ny_rss_summary3[1].text.strip()}\n')
winsound.PlaySound("notify.wav", winsound.SND_ALIAS)
Sample HTML text:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
xmlns:georss="http://www.georss.org/georss"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:media="http://search.yahoo.com/mrss/"
>
<channel>
<title>New York Post</title>
<atom:link href="https://nypost.com/feed/" rel="self" type="application/rss xml" />
<link>https://nypost.com</link>
<description>Your source for breaking news, news about New York, sports, business, entertainment, opinion, real estate, culture, fashion, and more.</description>
<lastBuildDate>Tue, 05 Jul 2022 14:06:44 0000</lastBuildDate>
<language>en-US</language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<generator>https://wordpress.org/?v=5.9.3</generator>
<item>
<title>Blue Jays coach Mark Budzinski’s daughter Julia died in boating accident</title>
<comments>https://nypost.com/2022/07/05/mark-budzinskis-daughter-julia-17-died-in-boating-accident/#respond</comments>
<pubDate>Tue, 05 Jul 2022 10:01:06 -0400</pubDate>
<link>https://nypost.com/2022/07/05/mark-budzinskis-daughter-julia-17-died-in-boating-accident/</link>
<dc:creator>Associated Press</dc:creator>
<guid isPermaLink="false">https://nypost.com/?post_type=article&p=22918233</guid>
<description><![CDATA[Pearson said no foul play is suspected and alcohol was not a factor. “It was a terrible accident,” she said.]]></description>
<content:encoded><![CDATA[Pearson said no foul play is suspected and alcohol was not a factor. “It was a terrible accident,” she said.]]></content:encoded>
<enclosure url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Julia-Budzinski.jpg?quality=90&strip=all" type="image/jpeg" />
<slash:comments>0</slash:comments>
<media:content url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Julia-Budzinski.jpg?w=1024" medium="image">
<media:title type="html">The Blue Jays held a moment of silence for first base coach Mark Budzinski's daughter Julia on Sunday.</media:title>
</media:content>
<media:content url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Mark-Budzinski.jpg?w=1024" medium="image">
<media:title type="html">Mark Budzinski</media:title>
</media:content>
CodePudding user response:
You have to iterate over the items
of the feed and check if title text
contains your term:
for e in soup.select('item'):
if 'Guide' in e.title.text:
print(e.title.text)
print(e.link.text)
Example
from bs4 import BeautifulSoup
import requests
xml_text = requests.get('https://nypost.com/feed/').text
soup = BeautifulSoup(xml_text, 'xml')
for e in soup.select('item'):
if 'Guide' in e.title.text:
print(e.title.text)
print(e.link.text)