Home > OS >  BeautifulSoup: Duplicated Class parsing how to overcome
BeautifulSoup: Duplicated Class parsing how to overcome

Time:01-13

I am finding some difficulties in parsing the HTML with "class" using BeautifulSoup. The idea is to get the price of an item on a website which has the following HTML exposure:

HTML structure

Therefore, I need the £920 as a text.

I have tried the following:

url = 'https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-t1-circle-pendant-69901190/'

#Open the url to monitor using a new user agent to avoid website blocks you
req = Request(
    url=url,
    headers={'User-Agent': 'Mozilla/5.0'}
)

#Read the HTML code of the url
webpage = urlopen(req, context=ctx).read()
soup = bs4.BeautifulSoup(webpage, "html.parser")

#Define the HTML element we need to screen and find prices
prices = soup.find("span", {"class": "product-description__addtobag_btn_text-static_price-wrapper_price"}).get_text()
print(prices)

And I am getting "[]" as an answer. I believe that since that the product-description__addtobag_btn_text-static_price-wrapper_price I am interested in a sub-part of the same product-description__addtobag_btn_text-static_price-wrapper_price BeautifulSoup takes the first that has no text. I am not sure how to overcome this.

Thank you!

CodePudding user response:

Data are not loaded as html but as json in a script markup:

<script type="application/ld json">
            {
                "@context": "http://schema.org",
                "@type": "Product",
                "description": "Tiffany T1 designs reinvent our iconic Tiffany T collection with bold profiles and powerful details. Precisely crafted in 18k yellow gold, this large circle pendant features a beveled edge that makes a striking statement. Wear it solo or layer with necklaces in different lengths for a distinctive look.",
                "name": "Tiffany T T1 Circle Pendant",
                "image": "//media.tiffany.com/is/image/Tiffany/EcomItemL2/tiffany-tt1-circle-pendant-69781926_1030892_ED.jpg?&op_usm=1.0,1.0,6.0&$cropN=0.1,0.1,0.8,0.8&defaultImage=NoImageAvailableInternal&",
                "url": "https://www.tiffany.co.uk/jewellery/necklaces & pendants/tiffany-t-t1-circle-pendant-69901190/",
                "sku": "69901190",
                "category": "Necklaces & Pendants",
                "material": ["Gold"],
                "color": [],
                "itemCondition": "New",
                "brand": "Tiffany & Co.",
                "offers":{
                    "@type": "Offer",
                    "priceCurrency": "GBP",
                    "price": "6675",
                    "url": "https://www.tiffany.co.uk/jewellery/necklaces & pendants/tiffany-t-t1-circle-pendant-69901190/"
                }
            }
        </script>

So, you can use:

import json

data = json.loads(soup.find_all('script', {'type': 'application/ld json'})[-1].get_text())
price = int(data['offers']['price'])

Output:

>>> price
6675

CodePudding user response:

The website you are trying to reach is using Javascript to load the page.
One way to grab the price is using the Selenium package:

Code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-t1-circle-pendant-69901190/")

# Wait for the element to be present and visible on the page
element = WebDriverWait(driver, 10).until(
    EC.visibility_of_element_located((By.CLASS_NAME, "product-description__addtobag_btn_text-static_price-wrapper_price"))
)

price = element.text
print(price)

Output:

£6,675

More info on the Selenium package can be found here.

  • Related