I am finding some difficulties in parsing the HTML with "class" using BeautifulSoup. The idea is to get the price of an item on a website which has the following HTML exposure:
Therefore, I need the £920 as a text.
I have tried the following:
url = 'https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-t1-circle-pendant-69901190/'
#Open the url to monitor using a new user agent to avoid website blocks you
req = Request(
url=url,
headers={'User-Agent': 'Mozilla/5.0'}
)
#Read the HTML code of the url
webpage = urlopen(req, context=ctx).read()
soup = bs4.BeautifulSoup(webpage, "html.parser")
#Define the HTML element we need to screen and find prices
prices = soup.find("span", {"class": "product-description__addtobag_btn_text-static_price-wrapper_price"}).get_text()
print(prices)
And I am getting "[]" as an answer. I believe that since that the product-description__addtobag_btn_text-static_price-wrapper_price I am interested in a sub-part of the same product-description__addtobag_btn_text-static_price-wrapper_price BeautifulSoup takes the first that has no text. I am not sure how to overcome this.
Thank you!
CodePudding user response:
Data are not loaded as html but as json in a script
markup:
<script type="application/ld json">
{
"@context": "http://schema.org",
"@type": "Product",
"description": "Tiffany T1 designs reinvent our iconic Tiffany T collection with bold profiles and powerful details. Precisely crafted in 18k yellow gold, this large circle pendant features a beveled edge that makes a striking statement. Wear it solo or layer with necklaces in different lengths for a distinctive look.",
"name": "Tiffany T T1 Circle Pendant",
"image": "//media.tiffany.com/is/image/Tiffany/EcomItemL2/tiffany-tt1-circle-pendant-69781926_1030892_ED.jpg?&op_usm=1.0,1.0,6.0&$cropN=0.1,0.1,0.8,0.8&defaultImage=NoImageAvailableInternal&",
"url": "https://www.tiffany.co.uk/jewellery/necklaces & pendants/tiffany-t-t1-circle-pendant-69901190/",
"sku": "69901190",
"category": "Necklaces & Pendants",
"material": ["Gold"],
"color": [],
"itemCondition": "New",
"brand": "Tiffany & Co.",
"offers":{
"@type": "Offer",
"priceCurrency": "GBP",
"price": "6675",
"url": "https://www.tiffany.co.uk/jewellery/necklaces & pendants/tiffany-t-t1-circle-pendant-69901190/"
}
}
</script>
So, you can use:
import json
data = json.loads(soup.find_all('script', {'type': 'application/ld json'})[-1].get_text())
price = int(data['offers']['price'])
Output:
>>> price
6675
CodePudding user response:
The website you are trying to reach is using Javascript to load the page.
One way to grab the price is using the Selenium package:
Code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-t1-circle-pendant-69901190/")
# Wait for the element to be present and visible on the page
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.CLASS_NAME, "product-description__addtobag_btn_text-static_price-wrapper_price"))
)
price = element.text
print(price)
Output:
£6,675
More info on the Selenium package can be found here.