Hello. I have a webpage to parse. The html code is in Figure. I need to extract the price, which is simple text:
<div >
"212,25 € "
<sup>HT</sup>
This is the only "price" class on the page. So I call the find() method:
soup = BeautifulSoup(get(url, headers=headers, params=params).content, 'lxml')
container = soup.find_all('div', class_="side-content") # Find a container
cost = container.find('div', {'class': 'price'}) # Find price class
cost_value = cost.next_sibling
The cost is None. I have tried .next_sibling function and .text functions. But as find() returns None, I have an exception. Any ideas? Thanks.
CodePudding user response:
The trick here is:
cost = cost.find(text=True).strip()
Where we find()
all the text, and strip()
any whitespaces.
find(text=True)
limits the output to the <div>
so it will ignore the nested <sup>
Regarding the container:
This is the only "price" class on the page
Then why bother? Just search for the price
from bs4 import BeautifulSoup
html = """
<div >
"212,25 € "
<sup>HT</sup>
"""
soup = BeautifulSoup(html, 'html.parser')
cost = soup.find('div', {'class': 'price'})
cost = cost.find(text=True).strip()
print(cost)
Will output:
212,25 €
CodePudding user response:
Ok thanks everyone. I have resolved it. The problem was in JavaScript generated data. So static parsing methods doesn't work with it. I tried several solutions (including selenium and XHR script results capturing).
Finally inside my parsed data I have found a static URL of a page that links to a separate web-page, where this JavaScript is executed and can be parsed by static methods. This video explains similar solution: https://www.youtube.com/watch?v=3fcKKZMFbyA&ab_channel=RedEyedCoderClub