Home > Enterprise >  Use OR in Lambda function - Web Scraping Python
Use OR in Lambda function - Web Scraping Python

Time:11-16

Using this example - How to extract html links with a matching word from a website using python

I wrote a web scraping script to look for keywords in recent and cashed versions of a local newspaper.

from bs4 import BeautifulSoup
import requests

urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
        'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
        'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
        'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
        'https://web.archive.org/web/20191111061843/https://www.marinij.com/']

dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']

for i, (url,date) in enumerate(zip(urls,dates)):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())
    
    results = soup.find_all(covid_links)

    num_art = str((len(results)))
    if not results:
        results = ["The term COVID did not appear in the headlines this quarter!\n"]

    textfile = open("marin_covid_"   date   ".txt", "w")
    for idx, element in enumerate(results):
        element = str(element)
        # print(element)
        if idx == 0:
            textfile.write(date   "\n"   "Number of articles = "   num_art   "\n"   "\n"   element   "\n")

        else:
            textfile.write(element   "\n"   "\n")
    textfile.close()

files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
        'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']

with open("COVID_articles_in_MIJ.txt", "w") as outfile:
    for filename in files:
        print(filename)
        with open(filename) as infile:
            contents = infile.read()
            outfile.write(contents)

It works really well when using only 1 keyword but when I try using the "or" function to look for 1 or more keyword it is only searching for the 1st word. This can be replicated by switching the 2 keywords in the example - "covid" and "corona".

I know the problem lies in this lambda function but I'm not sure how to address.

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())

This code should be fully executable if you have the prerequisites installed, all help is appreciated.

CodePudding user response:

As pointed out in the comments the issue was that 'in' operator must be included either side of the 'or' operator, so that the attribute being evaluated; in this case tag.get_text().lower() can be evaluated for both conditions - "corona" and "covid". The correct lambda function is this:

covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('covid' in tag.get_text().lower() or 'corona' in tag.get_text().lower()))
  • Related