How to scrap publication date from news?-CodePudding

I have a scraper of headlines, but I want a publication date also.

That's my code:

news = []

url = 1

while url != 100:
    website = f"https://www.newscientist.com/subject/space/page/{url}"
    r = requests.get(
        website,
        headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
            "Referer": website
        } 
    )

    soup = bs(r.text, 'html.parser')
   
    for h2 in soup.find_all("h2"):
        news.append(h2.get_text(strip=True))

The problem is the publication date is ''inside'' news and I don't know how to get there.

CodePudding user response：

There are different options, simplest one in my opinion is to uses their RSS-Feed:

import pandas as pd
pd.read_xml('https://www.newscientist.com/subject/space/feed/', xpath='*/item')

	title	link	pubDate	description	guid	{http://search.yahoo.com/mrss/}thumbnail
0	Bluewalker 3 satellite is brighter than 99.8 per cent of visible stars	https://www.newscientist.com/article/2348615-bluewalker-3-satellite-is-brighter-than-99-8-per-cent-of-visible-stars/?utm_campaign=RSS\|NSNS&utm_source=NSNS&utm_medium=RSS&utm_content=space	Fri, 25 Nov 2022 15:54:32 0000	Observations of a huge test satellite that launched in September have fuelled concerns about the impact a planned fleet could have on astronomy	2348615-bluewalker-3-satellite-is-brighter-than-99-8-per-cent-of-visible-stars	2348615
...	...	...	...	...	...
99	JWST's dazzling nebula image shows stars we have never seen before	https://www.newscientist.com/article/2336822-jwsts-dazzling-nebula-image-shows-stars-we-have-never-seen-before/?utm_campaign=RSS\|NSNS&utm_source=NSNS&utm_medium=RSS&utm_content=space	Tue, 06 Sep 2022 18:36:28 0100	Astronomers have used the James Webb Space Telescope to peer through the filaments of dust and gas in the Tarantula Nebula, the brightest and biggest stellar nursery around	2336822-jwsts-dazzling-nebula-image-shows-stars-we-have-never-seen-before	2336822

Alternative would be to iterate over each article: ... soup = BeautifulSoup(r.text, 'html.parser')

for a in soup.select('h2 a'):
    soup_article = BeautifulSoup(
        requests.get(
            'https://www.newscientist.com' a.get('href'),
            headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
            }
        ).text
    )
    news.append(
        {
            'title':soup_article.h1.text,
            'date':soup_article.select_one('.published-date').get_text(strip=True) if soup_article.select_one('.published-date') else None
        }
    )
news

CodePudding user response：

Here is an alternative, constructed upon @HedgeHog's answer:

import requests
from bs4 import BeautifulSoup as bs

space_rss = "https://www.newscientist.com/subject/space/feed/"

r = requests.get(
    space_rss,
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Referer": space_rss,
    },
)

if r.ok:
    soup = bs(r.text, "xml")
    news = [
        (item.select_one("pubDate").text, item.select_one("title").text)
        for item in soup.find_all("item")
    ]

It will populate news with a list of tuple(pubDate, title) from the RSS feed.