I have a scraper of headlines, but I want a publication date also.
That's my code:
news = []
url = 1
while url != 100:
website = f"https://www.newscientist.com/subject/space/page/{url}"
r = requests.get(
website,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
"Referer": website
}
)
soup = bs(r.text, 'html.parser')
for h2 in soup.find_all("h2"):
news.append(h2.get_text(strip=True))
The problem is the publication date is ''inside'' news
and I don't know how to get there.
CodePudding user response:
There are different options, simplest one in my opinion is to uses their RSS-Feed:
import pandas as pd
pd.read_xml('https://www.newscientist.com/subject/space/feed/', xpath='*/item')
title | link | pubDate | description | guid | {http://search.yahoo.com/mrss/}thumbnail | |
---|---|---|---|---|---|---|
0 | Bluewalker 3 satellite is brighter than 99.8 per cent of visible stars | https://www.newscientist.com/article/2348615-bluewalker-3-satellite-is-brighter-than-99-8-per-cent-of-visible-stars/?utm_campaign=RSS|NSNS&utm_source=NSNS&utm_medium=RSS&utm_content=space | Fri, 25 Nov 2022 15:54:32 0000 | Observations of a huge test satellite that launched in September have fuelled concerns about the impact a planned fleet could have on astronomy | 2348615-bluewalker-3-satellite-is-brighter-than-99-8-per-cent-of-visible-stars | 2348615 |
... | ... | ... | ... | ... | ... | |
99 | JWST's dazzling nebula image shows stars we have never seen before | https://www.newscientist.com/article/2336822-jwsts-dazzling-nebula-image-shows-stars-we-have-never-seen-before/?utm_campaign=RSS|NSNS&utm_source=NSNS&utm_medium=RSS&utm_content=space | Tue, 06 Sep 2022 18:36:28 0100 | Astronomers have used the James Webb Space Telescope to peer through the filaments of dust and gas in the Tarantula Nebula, the brightest and biggest stellar nursery around | 2336822-jwsts-dazzling-nebula-image-shows-stars-we-have-never-seen-before | 2336822 |
Alternative would be to iterate over each article: ... soup = BeautifulSoup(r.text, 'html.parser')
for a in soup.select('h2 a'):
soup_article = BeautifulSoup(
requests.get(
'https://www.newscientist.com' a.get('href'),
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
}
).text
)
news.append(
{
'title':soup_article.h1.text,
'date':soup_article.select_one('.published-date').get_text(strip=True) if soup_article.select_one('.published-date') else None
}
)
news
CodePudding user response:
Here is an alternative, constructed upon @HedgeHog's answer:
import requests
from bs4 import BeautifulSoup as bs
space_rss = "https://www.newscientist.com/subject/space/feed/"
r = requests.get(
space_rss,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
"Referer": space_rss,
},
)
if r.ok:
soup = bs(r.text, "xml")
news = [
(item.select_one("pubDate").text, item.select_one("title").text)
for item in soup.find_all("item")
]
It will populate news
with a list of tuple(pubDate, title)
from the RSS
feed.