Home > Mobile >  Ommitting specific text using BeautifulSoup
Ommitting specific text using BeautifulSoup

Time:05-12

Using BeautifulSoup I'm attempting to extract some very specific text from a website using a custom lambda function. I'm struggling to pick out exactly what I need while leaving the stuff out I don't need.

<div >
      
            <h3 >
                <span >
Barron&#x27;s                    </span>
                
                    <a  href="https://www.marketwatch.com/articles/more-bad-times-ahead-for-these-6-big-tech-stocks-51652197183?mod=mw_quote_news">
                        
                        
                        More Bad Times Ahead for These 6 Big Tech Stocks
                    </a>
            </h3>
        

        
        <div >
            <span  data-est="2022-05-10T11:39:00">May. 10, 2022 at 11:39 a.m. ET</span>

                
            
        </div>
    </div>

</div>

I'm looking to extract just the news headline - in this case it's "More Bad Times Ahead for These 6 Big Tech Stocks" and leave behind the annoying heading "Barron".

So far my function looks like:

for txt in soup.find_all(lambda tag: tag.name == 'h3' and tag.get('class') == ['article__headline']):
     print(txt.text)

I've attempted tag.name = "a" and tag.get('class') == ['link'] but that returns a load of other stuff I don't need from the webpage...

CodePudding user response:

Try CSS selector h3 a (select all <a> tags which are inside <h3> tag):

for title in soup.select("h3 a"):
    print(title.text.strip())

Prints:

More Bad Times Ahead for These 6 Big Tech Stocks

If you want to be more specific:

for title in soup.select("h3.article__headline a"):
    print(title.text.strip())
  • Related