Home > Mobile >  Only scrape google news articles containing an exact phrase in python
Only scrape google news articles containing an exact phrase in python

Time:05-28

I'm trying to build a media tracker in python that each day returns all google news articles containing a specific phrase, "Center for Community Alternatives". If, one day, there are no new news articles that exactly contain this phrase, then no new links should be added to the data frame. The problem I am having is that even on days when there are no news articles containing my phrase, my code adds articles that with similar phrases to the data frame. How can I only append links that contain my exact phrase?

Below I have attached an example code looking at 03/01/22:

from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd

googlenews=GoogleNews(start='03/01/2022',end='03/01/2022')
googlenews.search('"'   "Center for Community Alternatives"   '"')

googlenews.getpage(1)
result=googlenews.result()
df=pd.DataFrame(result)

df

Even though, when you search "Center for Community Alternatives" (with quotes around it) in Google News for this specific date, there are No results found for "center for community alternatives", the code scrapes the links that appear below this, which are Results for center for community alternatives (without quotes).

CodePudding user response:

The API you're using does not support exact match.

In https://github.com/Iceloof/GoogleNews/blob/master/GoogleNews/__init__.py:

def search(self, key):
    """
    Searches for a term in google.com in the news section and retrieves the first page into __results.
    Parameters:
    key = the search term
    """
    self.__key = " ".join(key.split(" "))
    if self.__encode != "":
        self.__key = urllib.request.quote(self.__key.encode(self.__encode))
    self.get_page()

As an alternative, you could probably just filter your data frame using an exact match:

df = df['Center for Community Alternatives' in df['title or some column']]

CodePudding user response:

Probably you're not getting any results due to:

  • There are not search results that matches your search term - "Center for Community Alternatives" and not in the date range you add in your question - 03/01/2022.

If you consider change the search term removing their double quotes AND if you increase the date range - you might have some results - that will depend entirely of how active they [the source] post news and how Google handles such topics.

What I suggest is to change your code for:

  • Keep the search term - Center for Community Alternatives without double quotes
  • Apply a longer date range to search
  • Get only distinct values - while doing test to this code, I got duplicated entries.
  • Get more than one page for increase the changes of get results.

Code:

#!pip install GoogleNews  # https://pypi.org/project/GoogleNews/
#!pip install newspaper3k # https://pypi.org/project/newspaper3k/

from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd

search_term = "Center for Community Alternatives"

googlenews=GoogleNews(start='03/01/2022',end='03/03/2022') # I suppose the date is in "MM/dd/yyyy" format...
googlenews=GoogleNews(lang='en', region='US')
googlenews.search(search_term)

# Initial list of results - it will contain a list of dictionaries (dict).
results = []

# Contains the final results = news filtered by the criteria 
# (news that in their description contains the search term).
final_results = []

# Get first 4 pages with the results and append those results to the list - you can set any other range according to  your needs: 
for page in range(1,4):
  googlenews.getpage(page) # Consider add an timer for avoid multiple calls and get "HTTP Error 429: Too Many Requests" error.
  results.extend(googlenews.result())

# Remove duplicates and include to the "final_results" list 
# only the news that includes in their description the search term: 
for item in results: 
  if (item not in final_results and (search_term in item["desc"])):
    final_results.append(item)

# Build and show the final dataframe: 
df=pd.DataFrame(results)
df

Keep in mind that probably you won't get results for factors outside of your reach.

  • Related