Scraping text containing certain caracters and names in Python?-CodePudding

I'm fairly new to python and working on a project in which I need all the quotes from certain people in a bunch of articles.

For this question I use this article as an example: https://www.theguardian.com/us-news/2021/oct/17/jeffrey-clark-scrutiny-trump-election-subversion-scheme

Right now, with Lambda, I am able to scrape text containing the names of the people I am looking for with the following code:

import requests
from bs4 import BeautifulSoup
url = 'https://www.theguardian.com/us-news/2021/oct/17/jeffrey-clark-scrutiny-trump-election-subversion-scheme'
response = requests.get(url)
data=response.text
soup=BeautifulSoup(data,'html.parser')
tags=soup.find_all('p')
words = ["Michael Bromwich"]
for tag in tags:
    quotes=soup.find("p",{"class":"dcr-s23rjr"}, text=lambda text: text and any(x in text for x in words)).text

print(quotes)

... which returns the block of text containing "Michael Bromwich", which in this case actually is a quote in the article. But when scraping 100 articles, this does not do the job, as other blocks of text may also contain the indicated names without containing a quote. I only want the strings of text containing the quotes.

Therefore, my question: Is it possible to print all HTML strings under the following criteria:

Text BEGINS with the caracter " (quotation mark) OR - (hyphen) AND CONTAINS the names "Michael Bromwich" OR "John Johnson" etc.

Thank you!

CodePudding user response：

First of all, you do not need the for tag in tags loop, you just need to use soup.find_all with your condition.

Next, you can check for the quotation marks or hyphen without any regex:

quotes = [x.text for x in  soup.find_all("p",{"class":"dcr-s23rjr"}, text=lambda t: t and (t.startswith("“") or t.startswith('"') or t.startswith("-")) and any(x in t for x in words))]

The (t.startswith("“") or t.startswith('"') or t.startswith("-")) part will check if the text starts with “, " or -.

Or,

quotes = [x.text for x in  soup.find_all("p",{"class":"dcr-s23rjr"}, text=lambda t: t and t.strip()[0] in '“"-' and any(x in t for x in words))]

The t.strip()[0] in '“"-' part checks if the “"- contains the first char of the stripped text value.