I am trying to web scrape googlenews with the gnews package. However, I don't know how to do web scraping for older articles like, for example, articles from 2010.
from gnews import GNews
from newspaper import Article
import pandas as pd
import datetime
google_news = GNews(language='es', country='Argentina', period = '7d')
argentina_news = google_news.get_news('protesta clarin')
print(len(argentina_news))
this code works perfectly to get recent articles but I need older articles. I saw https://github.com/ranahaani/GNews#todo and something like the following appears:
google_news = GNews(language='es', country='Argentina', period='7d', start_date='01-01-2015', end_date='01-01-2016', max_results=10, exclude_websites=['yahoo.com', 'cnn.com'],
proxy=proxy)
but when I try star_date I get:
TypeError: __init__() got an unexpected keyword argument 'start_date'
can anyone help to get articles for specific dates. Thank you very mucha guys!
CodePudding user response:
The example code is incorrect for gnews==0.2.7
which is the latest you can install off PyPI via pip
(or whatever). The documentation is for the unreleased mainline code that you can get directly off their git source.
Confirmed by inspecting the GNews::__init__
method, and the method doesn't have keyword args for start_date
or end_date
:
In [1]: import gnews
In [2]: gnews.GNews.__init__??
Signature:
gnews.GNews.__init__(
self,
language='en',
country='US',
max_results=100,
period=None,
exclude_websites=None,
proxy=None,
)
Docstring: Initialize self. See help(type(self)) for accurate signature.
Source:
def __init__(self, language="en", country="US", max_results=100, period=None, exclude_websites=None, proxy=None):
self.countries = tuple(AVAILABLE_COUNTRIES),
self.languages = tuple(AVAILABLE_LANGUAGES),
self._max_results = max_results
self._language = language
self._country = country
self._period = period
self._exclude_websites = exclude_websites if exclude_websites and isinstance(exclude_websites, list) else []
self._proxy = {'http': proxy, 'https': proxy} if proxy else None
File: ~/src/news-test/.venv/lib/python3.10/site-packages/gnews/gnews.py
Type: function
If you want the start_date
and end_date
functionality, that was only added rather recently, so you will need to install the module off their git source.
# use whatever you use to uninstall any pre-existing gnews module
pip uninstall gnews
# install from the project's git main branch
pip install git https://github.com/ranahaani/GNews.git
Now you can use the start/end functionality:
import datetime
import gnews
start = datetime.date(2015, 1, 15)
end = datetime.date(2015, 1, 16)
google_news = GNews(language='es', country='Argentina', start_date=start, end_date=end)
rsp = google_news.get_news("protesta")
print(rsp)
I get this as a result:
[{'title': 'Latin Roots: The Protest Music Of South America - NPR',
'description': 'Latin Roots: The Protest Music Of South America NPR',
'published date': 'Thu, 15 Jan 2015 08:00:00 GMT',
'url': 'https://www.npr.org/sections/world-cafe/2015/01/15/377491862/latin-roots-the-protest-music-of-south-america',
'publisher': {'href': 'https://www.npr.org', 'title': 'NPR'}}]
Also note:
period
is ignored if you setstart_date
andend_date
- Their documentation shows you can pass the dates as tuples like
(2015, 1, 15)
. This doesn't seem to work - just be safe and pass adatetime
object.