Home > OS >  Extract paragraphs with beautiful soup without getting excess html?
Extract paragraphs with beautiful soup without getting excess html?

Time:10-27

I am undergoing a project to web scrape the facts off the following website: https://holypython.com/100-python-tips-tricks/

Thanks to the response of a user in a previous query, I have been able to get the titles from this website, but am struggling to get paragraphs below the title. Using the following code I have attempted to get these by filtering for both 'p' (what inspect element recognises the individual sentences as), as well as for 'div.elementor-widget-wrap' (which is identified as the entire section).

from urllib.request import urlopen
from bs4 import BeautifulSoup
    
soup = BeautifulSoup(html, features="html.parser")
    
paragraphs = soup.select("div.elementor-widget-wrap")
    
paragraphs

However, both these result in excess data being given, which is challenging to remove and sort through (sample below).

<p>Python is cool, no doubt about it. But, there are some angles in Python that are even cooler than the usual Python stuff.</p>,
<p>Here you can find 100 Python tips and tricks carefully curated for you. There is probably something to impress almost anyone reading it.</p>,
<p><b>This is more like a Python Tricks Course that you can benefit from at various levels of difficulty and creativity. </b></p>,
<p>Holy Python is reader-supported. When you buy through links on our site, we may earn an affiliate commission.</p>,
<p>So, you might want to pace yourself to avoid burnout, skim it as you wish or go through each point if you’re a beginner. </p>,
<p>You can also blend it with your <b>#100daycoding</b> <b>#100daywithPython</b> routines.</p>,
<p>Tips are divided in 3 sections, this should make navigation a bit easier.: </p>,
<p>Enjoy.</p>,
<p >
  Flexible </p>,
<p >
  All </p>,
<p >
  100  </p>,
<p >
  0-35 </p>,
<p >
  35-70 </p>,
<p >
  70-100 </p>,
<p>If you took a Python course these somewhat niche methods might have escaped the curriculum. Or if you’re switching to Python from another language you might not have heard of these unique approaches even as a seasoned programmer.</p>,

How can I make it so that I only receive the paragraph for each fact, or atleast remove the html? Thank you in advance.

CodePudding user response:

You can iterate through the soup.select("div.elementor-widget-wrap") and select paragraphs,

paragraphs = [para.text for div in soup.select("div.elementor-widget-wrap")
                        for para in div.find_all('p')]

Result, print(paragraphs[:5])

['Python is cool, no doubt about it. But, there are some angles in Python that are even cooler than the usual Python stuff.',
 'Here you can find 100 Python tips and tricks carefully curated for you. There is probably something to impress almost anyone reading it.',
 'This is more like a Python Tricks Course that you can benefit from at various levels of difficulty and creativity.\xa0',
 'Holy Python is reader-supported. When you buy through links on our site, we may earn an affiliate commission.',
 'So, you might want to pace yourself to avoid burnout, skim it as you wish or go through each point if you’re a beginner.\xa0']
  • Related