I'm practicing to extract some information via web scraping from website https://www.kerastase.com.au/ . As an example, I'm focusing on Best Seller items (7 items). I have been able to extract name, description and price using the following code.
import requests
from bs4 import BeautifulSoup
url='https://www.kerastase.com.au/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
prod_names = soup.find_all("h3", class_="c-product-tile__name")
prod_names = [prod.get_text() for prod in prod_names]
prices = soup.find_all("span", class_="c-product-price__value")
prices = [float(price.get_text()[2:]) for price in prices if (len(price) > 0)]
prod_descs = soup.find_all("p", class_="c-product-tile__description")
prod_descs = [desc.get_text() for desc in prod_descs]
However, extracting rating and number of reviews seem to be more complicated. It is a nested div. I have been able to extract caption of the first item using the following command; however it is a mess, and don't know what to do after this step:
soup.findAll('figcaption', class_="c-product-tile__caption")[0]
Here is an example of full caption of one item I get:
<figcaption > <div > <div > <button aria-label="Add to Wishlist Elixir Ultime Pride Edition Hair Oil" aria-pressed="" data-analytics='{"products":[{"pid":"3474637116088","title":"Elixir Ultime Pride Edition Hair Oil","description":"","url":"https://www.kerastase.com.au/collections/elixir-ultime/elixir-ultime-pride-edition-hair-oil/3474637116088.html","imgUrl":"https://www.kerastase.com.au/on/demandware.static/-/Sites-kerastase-master-catalog/default/dw377882d1/2022/Elixir Ultime/Pride/1. Product.jpg","currency":"AUD","price":65,"name":"Elixir Ultime Pride Edition Hair Oil","subname":"Iconic nourishing hair oil for all hair types. Kérastase will be donating to Minus18, subsidising LGBTQIA Inclusion Workshops for schools across Australia.","id":"elixir-pride","salePrice":65,"brand":"Kérastase","category":"others/collections/elixir ultime","productTopCategory":"products","variant":"100 ml","size":"100 ml","color":"","fragrance":"","stock":"in stock","autoReplenishmentInterval":"not present","upc":"3474637116088","regularPrice":null,"isProductSet":false,"isProductGroup":false,"isBundle":false,"bundleID":"","rating":5,"numberReviews":2,"vtoState":"not present","collection":["Elixir Ultime"],"customizations":{"engraving":"not present"},"badges":"none","remainingStock":null}],"label":"elixir ultime pride edition hair oil::3474637116088","category":"{{dataLayer.page.category}}"}' data-component="product/AddToWishlist" data-component-options='{"pid":"3474637116088","url":{"add":"https://www.kerastase.com.au/on/demandware.store/Sites-kerastase-au-ng-Site/en_AU/Wishlist-AddToWishList","remove":"https://www.kerastase.com.au/on/demandware.store/Sites-kerastase-au-ng-Site/en_AU/Wishlist-RemoveFromWishList"},"text":{"title":{"add":"Add to Wishlist","remove":"Remove from Wishlist"},"accessibility":{"addAriaLabel":"Add to Wishlist Elixir Ultime Pride Edition Hair Oil","removeAriaLabel":"Remove from Wishlist Elixir Ultime Pride Edition Hair Oil"}},"isLabel":false}' title="Add to Wishlist"> <span data-js-wishlist-text="">Wishlist</span> </button> </div> <h3 ><a data-js-product-name="" data-lora-datalayer='{"products":{"3474637116088":{"name":"Elixir Ultime Pride Edition Hair Oil"}}}' href="/collections/elixir-ultime/elixir-ultime-pride-edition-hair-oil/elixir-pride.html"> Elixir Ultime Pride Edition Hair Oil </a></h3><p > Iconic nourishing hair oil for all hair types. Kérastase will be donating to Minus18, subsidising LGBTQIA Inclusion Workshops for schools across Australia. </p> <div > <div > <div data-bv-productid="elixir-pride" data-bv-redirect-url="/collections/elixir-ultime/elixir-ultime-pride-edition-hair-oil/elixir-pride.html" data-bv-seo="false" data-bv-show="inline_rating" data-component="product/BazaarvoiceService"> </div> </div> <div > <div data-component="product/ProductPrice" data-component-options='{"pid":"3474637116088","reloadData":{"configid":null},"dataModelId":"productprice"}'> <span data-js-pricelabel="">Old price</span> <span data-js-standardprice=""></span> <span data-js-pricelabel="">New price</span> <span data-js-saleprice="">A$65.00</span> </div> </div> </div> <div > <div > </div> <div > <div >One size available</div> <div > <span data-js-pid="">100 ml</span> </div> </div> </div> </div> <div data-js-producttile-actions=""> <div data-component="global/ComponentPlaceholder" data-component-options='{"_lazyload":true,"reloadData":{"id":"productmainaction","section":"product","configid":"producttile","reloadUrl":"https://www.kerastase.com.au/on/demandware.store/Sites-kerastase-au-ng-Site/en_AU/CDSLazyload-product_productmainaction?configid=producttile&data=3474637116088&id=productmainaction&pageId=homepage&section=product"}}'> <button > <span>Loading ...</span> </button> </div> </div> </figcaption>
How can I get products rating and number of reviews from this? Example: "rating":5,"numberReviews":2
(It is probably possible to get all product info from the above, but don't know what the best method is).
CodePudding user response:
If you find main specific tag for product details data is inside in button
tag and it contains json
formatted data so we can use data and find the relaticve information
main_tag=soup.find_all("div",class_="c-product-tile__figure")
import json
dict1={}
for i in range(len(main_tag)):
json_data=main_tag[i].find("button")['data-analytics']
details=json.loads(json_data)
price=details['products'][0]['price']
rating=details['products'][0]['rating']
numberReviews=details['products'][0]['numberReviews']
title=details['products'][0]['title']
dict1[i]={'name':title,'price':price,'rating':rating,'reviews':numberReviews}
Output:
{0: {'name': 'Elixir Ultime Pride Edition Hair Oil',
'price': 65,
'rating': 5,
'reviews': 2},
1: {'name': 'Nutritive 8HR Magic Night Hair Serum',
'price': 67,
'rating': 4.5701,
'reviews': 749},
....
}