Home > Software engineering >  Unable to print the information in this div on a webpage? - Tried multiple methods - Python - BS4
Unable to print the information in this div on a webpage? - Tried multiple methods - Python - BS4

Time:12-09

Currently having some trouble attempting to pull the below text from the webpage:

"https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862"

I am using the below code and I'm trying to print the product name, product price and number available in stock.

I am easily able to print the name and price, but seem to be unable to print the # in stock.

I have tried using both StockInformation_stock__3OYkv & DefaultTemplate_product-stock-information__dFTUx but I am either presented with nothing, or the price again.

What am i doing wrong?

Thanks in advance.

import requests
from bs4 import BeautifulSoup

url = 'https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862'
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
})

soup = BeautifulSoup(response.content, 'html.parser')

numInStock = soup.find(class_="StockInformation_stock__3OYkv").get_text().strip()
productName = soup.find(id="confirmation-anchor-desktop").get_text().strip()
productPrice = soup.find(class_="ProductPrice_price__DcrIr").get_text().strip()

print (productName)
print (productPrice)
print (numInStock)

CodePudding user response:

The webpage you chose has some dynamic elements, meaning rapidly changing elements such as the stock number. In this case, the page you pulled first displays the more static elements such as the product name and price, then does supplementary requests to different API urls for the data on stock (since it changes frequently). After the browser requests the supplemental data it injects it into the original HTML page, which is why the frame of the name and product are there but not the stock. In simple terms, the webpage is "still loading" as you do the request to grab it, and there is hundreds of other requests for images, files, and data that must also be done to get the rest of the data for the full image that your browser and eyes would regularly see.

Fortunately, we only need one more request, which grabs the stock data.

To fix this, we are going to do an additional request to the URL for the stock information. I am unsure how much you know about reverse engineering but I'll touch on it lightly. I did some reverse engineering and found it is to https://www.johnlewis.com/fashion-ui/api/stock/v2 in the form of a post with the json parameters {"skus":["240280782"]} (the skus being a list of products). The SKU is available in the webpage, so the full code to get the stock is as follows:

import requests
from bs4 import BeautifulSoup

url = 'https://www.johnlewis.com/longchamp-le-pliage-original-large-shoulder-bag/p5051141'
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
})

soup = BeautifulSoup(response.content, 'html.parser')

numInStock = soup.find(class_="StockInformation_stock__3OYkv").get_text().strip()
productName = soup.find(id="confirmation-anchor-desktop").get_text().strip()

# also find the sku by extracting the numbers out of the following mess found in the webpage:  ......"1,150.00"},"productId":"5807862","sku":"240280782","url":"https://www.johnlewis.com/mulberry-ba.....
sku = response.text.split('"sku":"')[1].split('"')[0]

#supplemental request with the newfound sku 

response1 = requests.post('https://www.johnlewis.com/fashion-ui/api/stock/v2', headers={
    'authority': 'www.johnlewis.com',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
    'content-type': 'application/json',
    'accept': '*/*',
    'origin': 'https://www.johnlewis.com',
    'referer': 'https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862',
}, json={"skus":[sku]})
# returns the json: {"stocks":[{"skuId":"240280782","stockQuantity":2,"availabilityStatus":"SKU_AVAILABLE","stockMessage":"Only 2 in stock online","lastUpdated":"2021-12-05T22:03:27.613Z"}]}

# index the json
try:
    productPrice = response1.json()["stocks"][0]["stockQuantity"]
except:
    print("There was an error getting the stock")
    productPrice = "NaN"

print (productName)
print (productPrice)
print (numInStock)

I also made sure to test via other products. Since we dynamically simulate what a webpage does by Step 1. getting the page template, then Step 2. using the data from the template to make additional requests to the server, it works for any product URL.

This is EXTREMELY difficult and a pain. Don't knock yourself down if you don't fully understand as you need knowledge of front end, back end, json, and parsing to get it.

Cheers!

  • Related