Scrape data with <Script type="text/javascript" using beautifulsoup-CodePudding

Im building a web scrape to pull product data from a website, this particular company hides the price behind a "login for Price" banner but the price is hidden in the HTML under <Script type="text/javascript" but im unable to pull it out. the specific link that im testing is https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/

My current code is this and the last line is the one im using to pull the text out.

```
import requests
from bs4 import BeautifulSoup
import pandas as pd

baseurl="https://www.chadwellsupply.com/"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}



productlinks = []
for x in range (1,3):
    response = requests.get(f'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/?q=&filter=&clearedfilter=undefined&orderby=19&pagesize=24&viewmode=list&currenttab=products&pagenumber={x}&articlepage=')
    soup = BeautifulSoup(response.content,'html.parser')

    productlist = soup.find_all('div', class_="product-header")



    for item in productlist:
        for link in item.find_all('a', href = True):
            productlinks.append(link['href'])
    


testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'

response = requests.get(testlink, headers = headers)

soup = BeautifulSoup(response.content,'html.parser')

print(soup.find('div',class_="product-title").text.strip())                
print(soup.find('p',class_="status").text.strip())         
print(soup.find('meta',{'property':'og:url'}))
print(soup.find('div',class_="tab-pane fade show active").text.strip())
print(soup.find('div',class_="Chadwell-Shared-Breadcrumbs").text.strip())
print(soup.find('script',{'type':'text/javascript'}).text.strip())
```

Below is the chunk of script from the website (tried to paste directly here but it wouldnt format correctly) that im expecting it to pull but what it gives me is "window.dataLayer = window.dataLayer || [];"

HTML From website

Ideally id like to just pull the price out but if i can atleast get the whole chunk of data out i can manually extract price.

CodePudding user response：

You can use re/json module to search/parse the HTML data (obviously, beautifulsoup cannot parse JavaScript - another option is to use selenium).

import re
import json
import requests

url = "https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/"

html_doc = requests.get(url).text

data = re.search(r"ga\('ec:addProduct', (.*?)\);", html_doc).group(1)
data = json.loads(data)

print(data)

Prints:

{
    "id": "301078",
    "name": 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE',
    "category": "Stove/ Ranges",
    "brand": "Hotpoint",
    "price": "759",
}

Then for price you can do:

print(data["price"])

Prints:

CodePudding user response：

A hacky alternative to regex is to select for a function in the scripts. In your case, the script contains function(i,s,o,g,r,a,m).

from bs4 import BeautifulSoup
import requests
import json

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'

response = requests.get(testlink, headers = headers)

soup = BeautifulSoup(response.content,'html.parser')

for el in soup.find_all("script"):
    if "function(i,s,o,g,r,a,m)" in el.text:
        scripttext = el.text

You can then select the data.

extracted = scripttext.split("{")[-1].split("}")[0]

my_json = json.loads("{%s}" % extracted)

print(my_json)
#{'id': '301078', 'name': 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE', 'category': 'Stove/ Ranges', 'brand': 'Hotpoint', 'price': '759'}

Then get the price.

print(my_json["price"])
#759