Im building a web scrape to pull product data from a website, this particular company hides the price behind a "login for Price" banner but the price is hidden in the HTML under <Script type="text/javascript" but im unable to pull it out. the specific link that im testing is https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/
My current code is this and the last line is the one im using to pull the text out.
```
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl="https://www.chadwellsupply.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
productlinks = []
for x in range (1,3):
response = requests.get(f'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/?q=&filter=&clearedfilter=undefined&orderby=19&pagesize=24&viewmode=list¤ttab=products&pagenumber={x}&articlepage=')
soup = BeautifulSoup(response.content,'html.parser')
productlist = soup.find_all('div', class_="product-header")
for item in productlist:
for link in item.find_all('a', href = True):
productlinks.append(link['href'])
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
print(soup.find('div',class_="product-title").text.strip())
print(soup.find('p',class_="status").text.strip())
print(soup.find('meta',{'property':'og:url'}))
print(soup.find('div',class_="tab-pane fade show active").text.strip())
print(soup.find('div',class_="Chadwell-Shared-Breadcrumbs").text.strip())
print(soup.find('script',{'type':'text/javascript'}).text.strip())
```
Below is the chunk of script from the website (tried to paste directly here but it wouldnt format correctly) that im expecting it to pull but what it gives me is "window.dataLayer = window.dataLayer || [];"
Ideally id like to just pull the price out but if i can atleast get the whole chunk of data out i can manually extract price.
CodePudding user response:
You can use re
/json
module to search/parse the HTML data (obviously, beautifulsoup
cannot parse JavaScript - another option is to use selenium
).
import re
import json
import requests
url = "https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/"
html_doc = requests.get(url).text
data = re.search(r"ga\('ec:addProduct', (.*?)\);", html_doc).group(1)
data = json.loads(data)
print(data)
Prints:
{
"id": "301078",
"name": 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE',
"category": "Stove/ Ranges",
"brand": "Hotpoint",
"price": "759",
}
Then for price you can do:
print(data["price"])
Prints:
759
CodePudding user response:
A hacky alternative to regex is to select for a function in the scripts. In your case, the script contains function(i,s,o,g,r,a,m)
.
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
for el in soup.find_all("script"):
if "function(i,s,o,g,r,a,m)" in el.text:
scripttext = el.text
You can then select the data.
extracted = scripttext.split("{")[-1].split("}")[0]
my_json = json.loads("{%s}" % extracted)
print(my_json)
#{'id': '301078', 'name': 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE', 'category': 'Stove/ Ranges', 'brand': 'Hotpoint', 'price': '759'}
Then get the price.
print(my_json["price"])
#759