Use OR in Lambda function - Web Scraping Python-CodePudding

Using this example - How to extract html links with a matching word from a website using python

I wrote a web scraping script to look for keywords in recent and cashed versions of a local newspaper.

from bs4 import BeautifulSoup
import requests

urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
        'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
        'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
        'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
        'https://web.archive.org/web/20191111061843/https://www.marinij.com/']

dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']

for i, (url,date) in enumerate(zip(urls,dates)):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())
    
    results = soup.find_all(covid_links)

    num_art = str((len(results)))
    if not results:
        results = ["The term COVID did not appear in the headlines this quarter!\n"]

    textfile = open("marin_covid_"   date   ".txt", "w")
    for idx, element in enumerate(results):
        element = str(element)
        # print(element)
        if idx == 0:
            textfile.write(date   "\n"   "Number of articles = "   num_art   "\n"   "\n"   element   "\n")

        else:
            textfile.write(element   "\n"   "\n")
    textfile.close()

files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
        'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']

with open("COVID_articles_in_MIJ.txt", "w") as outfile:
    for filename in files:
        print(filename)
        with open(filename) as infile:
            contents = infile.read()
            outfile.write(contents)

It works really well when using only 1 keyword but when I try using the "or" function to look for 1 or more keyword it is only searching for the 1st word. This can be replicated by switching the 2 keywords in the example - "covid" and "corona".

I know the problem lies in this lambda function but I'm not sure how to address.

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())

This code should be fully executable if you have the prerequisites installed, all help is appreciated.

CodePudding user response：

As pointed out in the comments the issue was that 'in' operator must be included either side of the 'or' operator, so that the attribute being evaluated; in this case tag.get_text().lower() can be evaluated for both conditions - "corona" and "covid". The correct lambda function is this:

covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('covid' in tag.get_text().lower() or 'corona' in tag.get_text().lower()))