Home > Software design >  Removing tags from a selected number by iterating using beautifulsoup
Removing tags from a selected number by iterating using beautifulsoup

Time:10-30

I am trying to clean html data using beautiful soup. I want to remove a set of tags along with the data associated in that tags which are consescutive starting from et_pb_row_inner et_pb_row_inner_2 to et_pb_row_inner et_pb_row_inner_22 .

enter image description here

The code which i was trying is like this

Code

def madisonsymphony(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    for h in soup.find_all('header'):
        try:
            h.extract()
        except:
            pass
    for f in soup.find_all('footer'):
        try:
            f.extract()
        except:
            pass
    tophead = soup.find("div",{"id":"top-header"})
    tophead.extract()
    for x in range(2,23):
        mydiv = soup.find("div", {"class": "et_pb_row_inner et_pb_row_inner_{}".format(x)})
        mydiv.extract()
    text = soup.getText(separator=u' ')
    return text

I got it by individually specifying the class name using find(), but how is it possible to do in a general manner.

CodePudding user response:

You could use a regex to find all the <div> tags that have those attributes and end in 2 or higher.

So basically the regex r'et_pb_row_inner et_pb_row_inner_([2-9]|[\d]{2,}).*' is saying find all the et_pb_row_inner et_pb_row_inner_ that end in a single digit of 2 through 9 or is a digit in length of two or more.

def madisonsymphony(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    for h in soup.find_all('header'):
        try:
            h.extract()
        except:
            pass
    for f in soup.find_all('footer'):
        try:
            f.extract()
        except:
            pass
    tophead = soup.find("div",{"id":"top-header"})
    tophead.extract()
    
    for mydiv in soup.find_all("div", {"class":re.compile(r'et_pb_row_inner et_pb_row_inner_([2-9]|[\d]{2,}).*')}):
        mydiv.extract()
    text = soup.getText(separator=u' ')
    return text

This way you don't need to hard code the range 2 through 21. It'll just go from 2 to whatever that last value is. The other way to do it is just use slicing.

def madisonsymphony(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    for h in soup.find_all('header'):
        try:
            h.extract()
        except:
            pass
    for f in soup.find_all('footer'):
        try:
            f.extract()
        except:
            pass
    tophead = soup.find("div",{"id":"top-header"})
    tophead.extract()
    
    mydivs = soup.find_all("div", {"class":re.compile(r'et_pb_row_inner et_pb_row_inner_.*')})
    for mydiv in mydivs[2:]: # Start at the 2nd element in the list and continue to the end
        mydiv.extract()
    text = soup.getText(separator=u' ')
    return text

Problem with that is you have to make the assumption theres a 0 and 1. If for whatever reason the attributes starts at 1, then youre keeping the 2, which would be the second element.

  • Related