Home > Blockchain >  Python - Beautifulsoup - extract text between <div> and <sup>
Python - Beautifulsoup - extract text between <div> and <sup>

Time:03-02

html

Hello. I have a webpage to parse. The html code is in Figure. I need to extract the price, which is simple text:

<div >
"212,25 € "
<sup>HT</sup>

This is the only "price" class on the page. So I call the find() method:

soup = BeautifulSoup(get(url, headers=headers, params=params).content, 'lxml')
container = soup.find_all('div', class_="side-content") # Find a container
cost = container.find('div', {'class': 'price'}) # Find price class
cost_value = cost.next_sibling

The cost is None. I have tried .next_sibling function and .text functions. But as find() returns None, I have an exception. Any ideas? Thanks.

CodePudding user response:

The trick here is:

cost = cost.find(text=True).strip()

Where we find() all the text, and strip() any whitespaces.

find(text=True) limits the output to the <div> so it will ignore the nested <sup>


Regarding the container:

This is the only "price" class on the page

Then why bother? Just search for the price


from bs4 import BeautifulSoup

html = """
<div >
    "212,25 € "
<sup>HT</sup>
"""

soup = BeautifulSoup(html, 'html.parser')

cost = soup.find('div', {'class': 'price'})
cost = cost.find(text=True).strip()

print(cost)

Will output:

212,25 € 

CodePudding user response:

Ok thanks everyone. I have resolved it. The problem was in JavaScript generated data. So static parsing methods doesn't work with it. I tried several solutions (including selenium and XHR script results capturing).

Finally inside my parsed data I have found a static URL of a page that links to a separate web-page, where this JavaScript is executed and can be parsed by static methods. This video explains similar solution: https://www.youtube.com/watch?v=3fcKKZMFbyA&ab_channel=RedEyedCoderClub

  • Related