I'm fairly new to python and working on a project in which I need all the quotes from certain people in a bunch of articles.
For this question I use this article as an example: https://www.theguardian.com/us-news/2021/oct/17/jeffrey-clark-scrutiny-trump-election-subversion-scheme
Right now, with Lambda, I am able to scrape text containing the names of the people I am looking for with the following code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.theguardian.com/us-news/2021/oct/17/jeffrey-clark-scrutiny-trump-election-subversion-scheme'
response = requests.get(url)
data=response.text
soup=BeautifulSoup(data,'html.parser')
tags=soup.find_all('p')
words = ["Michael Bromwich"]
for tag in tags:
quotes=soup.find("p",{"class":"dcr-s23rjr"}, text=lambda text: text and any(x in text for x in words)).text
print(quotes)
... which returns the block of text containing "Michael Bromwich", which in this case actually is a quote in the article. But when scraping 100 articles, this does not do the job, as other blocks of text may also contain the indicated names without containing a quote. I only want the strings of text containing the quotes.
Therefore, my question: Is it possible to print all HTML strings under the following criteria:
Text BEGINS with the caracter " (quotation mark) OR - (hyphen) AND CONTAINS the names "Michael Bromwich" OR "John Johnson" etc.
Thank you!
CodePudding user response:
First of all, you do not need the for tag in tags
loop, you just need to use soup.find_all
with your condition.
Next, you can check for the quotation marks or hyphen without any regex:
quotes = [x.text for x in soup.find_all("p",{"class":"dcr-s23rjr"}, text=lambda t: t and (t.startswith("“") or t.startswith('"') or t.startswith("-")) and any(x in t for x in words))]
The (t.startswith("“") or t.startswith('"') or t.startswith("-"))
part will check if the text starts with “
, "
or -
.
Or,
quotes = [x.text for x in soup.find_all("p",{"class":"dcr-s23rjr"}, text=lambda t: t and t.strip()[0] in '“"-' and any(x in t for x in words))]
The t.strip()[0] in '“"-'
part checks if the “"-
contains the first char of the stripped text value.