Using BeautifulSoup I'm attempting to extract some very specific text from a website using a custom lambda function. I'm struggling to pick out exactly what I need while leaving the stuff out I don't need.
<div >
<h3 >
<span >
Barron's </span>
<a href="https://www.marketwatch.com/articles/more-bad-times-ahead-for-these-6-big-tech-stocks-51652197183?mod=mw_quote_news">
More Bad Times Ahead for These 6 Big Tech Stocks
</a>
</h3>
<div >
<span data-est="2022-05-10T11:39:00">May. 10, 2022 at 11:39 a.m. ET</span>
</div>
</div>
</div>
I'm looking to extract just the news headline - in this case it's "More Bad Times Ahead for These 6 Big Tech Stocks" and leave behind the annoying heading "Barron".
So far my function looks like:
for txt in soup.find_all(lambda tag: tag.name == 'h3' and tag.get('class') == ['article__headline']):
print(txt.text)
I've attempted tag.name = "a" and tag.get('class') == ['link'] but that returns a load of other stuff I don't need from the webpage...
CodePudding user response:
Try CSS selector h3 a
(select all <a>
tags which are inside <h3>
tag):
for title in soup.select("h3 a"):
print(title.text.strip())
Prints:
More Bad Times Ahead for These 6 Big Tech Stocks
If you want to be more specific:
for title in soup.select("h3.article__headline a"):
print(title.text.strip())