I have a website with products https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01 When I inspect the html page I see they have all info in json format in script tag under
window.INITIAL_DATA = JSON.parse('{"pa...')
I tried to scrape the html with requests and get the json string with regex, however my code somehow change the json structure and I cannot load it with json.loads()
response = requests.get('https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
regex = "JSON.parse\(.*;"
match = re.search(regex, str(soup))
json_string = match.group(0).replace("JSON.parse(", "")[1:-3]
json_data = json.loads(json_string)
it ends with json error because there are multiple weird spaces and " which does json library in python cannot handle
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 22173 (char 22172)
Is there a way how to get the json data or even better how to execute the window.INITIAL_DATA function directly in html response in python?
CodePudding user response:
Try:
import re
import js2py
import requests
url = "https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01"
html_doc = requests.get(url).text
data = re.search(r"window\.INITIAL_DATA = (.*)", html_doc)
data = js2py.eval_js(data.group(1))
print(data)
Prints:
{
"currentCountry": {
"englishName": "Sweden",
"localName": "Sverige",
"twoLetterCode": "SE",
},
"currentCurrency": "SEK",
"currentLanguage": "sv-SE",
"currentLanguageRevision": "43",
"currentLanguageTwoLetterName": "sv",
"dynamicData": [
{
"data": {},
"type": "NordicNest.ContentApi.DynamicData.MenuApiModel,NordicNest.ContentApi",
},
{
"type": "NordicNest.Core.Contentful.Model.SiteLayout.Footer,NordicNest.Core"
},
...