I'm trying to scrape out some product specifications from some e-commerce website. So I have a list of URLs to various products, I need my code to go to each (this part is easy) and scrape out the product specs I need. I have been trying to use ParseHub — it works for some links but it does not for other. My suspicion is, for example, 'Wheel diameter' changes its location every time so it ends up grabbing wrong spec value.
One of such parts, for example, in HTML looks like this:
<div >
<span >Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
What I think I could do is if I use BeautifulSoup and if I could somehow using smth like
if soup.find("span", class_ = "product-detail-key").text.strip()=="Wheel Diameter":
*go to the next line and grab the string inside*
How can I code this? I really apologize if my question sounds silly, pardon my ignorance, I'm pretty new to webscraping.
CodePudding user response:
You can use .find_next()
function:
from bs4 import BeautifulSoup
html_doc = """
<div >
<span >Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
diameter = soup.find("span", text="Wheel Diameter").find_next("span").text
print(diameter)
Prints:
8 Inches
Or using CSS selector with
:
diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") *').text
CodePudding user response:
Using css selectors
you can simply chain / combinate your selection to be more strict. In this case you select the <span>
contains your string and use adjacent sibling combinator
to get the next sibling <span>
.
diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") span').text
or
diameter = soup.select_one('span.product-detail-key:-soup-contains("Wheel Diameter") span').text
Note: To avoid AttributeError: 'NoneType' object has no attribute 'text'
, if element is not available you can check if it exists before calling text
method:
diameter = e.text if (e := soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") span')) else None
Example
from bs4 import BeautifulSoup
html_doc = """
<div >
<span >Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
diameter = e.text if (e := soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") span')) else None