Using this example - How to extract html links with a matching word from a website using python
I wrote a web scraping script to look for keywords in recent and cashed versions of a local newspaper.
from bs4 import BeautifulSoup
import requests
urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
'https://web.archive.org/web/20191111061843/https://www.marinij.com/']
dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']
for i, (url,date) in enumerate(zip(urls,dates)):
r = requests.get(url)
soup = BeautifulSoup(r.content)
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
results = soup.find_all(covid_links)
num_art = str((len(results)))
if not results:
results = ["The term COVID did not appear in the headlines this quarter!\n"]
textfile = open("marin_covid_" date ".txt", "w")
for idx, element in enumerate(results):
element = str(element)
# print(element)
if idx == 0:
textfile.write(date "\n" "Number of articles = " num_art "\n" "\n" element "\n")
else:
textfile.write(element "\n" "\n")
textfile.close()
files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']
with open("COVID_articles_in_MIJ.txt", "w") as outfile:
for filename in files:
print(filename)
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
It works really well when using only 1 keyword but when I try using the "or" function to look for 1 or more keyword it is only searching for the 1st word. This can be replicated by switching the 2 keywords in the example - "covid" and "corona".
I know the problem lies in this lambda function but I'm not sure how to address.
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
This code should be fully executable if you have the prerequisites installed, all help is appreciated.
CodePudding user response:
As pointed out in the comments the issue was that 'in' operator must be included either side of the 'or' operator, so that the attribute being evaluated; in this case tag.get_text().lower() can be evaluated for both conditions - "corona" and "covid". The correct lambda function is this:
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('covid' in tag.get_text().lower() or 'corona' in tag.get_text().lower()))