Home > Net >  Execute js function in HTML page scraped by python to get json data
Execute js function in HTML page scraped by python to get json data

Time:11-17

I have a website with products https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01 When I inspect the html page I see they have all info in json format in script tag under

window.INITIAL_DATA = JSON.parse('{"pa...')

I tried to scrape the html with requests and get the json string with regex, however my code somehow change the json structure and I cannot load it with json.loads()

response = requests.get('https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
regex = "JSON.parse\(.*;"
match = re.search(regex, str(soup))
json_string = match.group(0).replace("JSON.parse(", "")[1:-3]
json_data = json.loads(json_string)

it ends with json error because there are multiple weird spaces and " which does json library in python cannot handle

json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 22173 (char 22172)

Is there a way how to get the json data or even better how to execute the window.INITIAL_DATA function directly in html response in python?

CodePudding user response:

Try:

import re
import js2py
import requests


url = "https://www.svenssons.se/varumarken/swedese/lamino-fatolj-och-fotpall-lackad-bokfarskinn/?variantId=514023-01"

html_doc = requests.get(url).text
data = re.search(r"window\.INITIAL_DATA = (.*)", html_doc)
data = js2py.eval_js(data.group(1))

print(data)

Prints:

{
    "currentCountry": {
        "englishName": "Sweden",
        "localName": "Sverige",
        "twoLetterCode": "SE",
    },
    "currentCurrency": "SEK",
    "currentLanguage": "sv-SE",
    "currentLanguageRevision": "43",
    "currentLanguageTwoLetterName": "sv",
    "dynamicData": [
        {
            "data": {},
            "type": "NordicNest.ContentApi.DynamicData.MenuApiModel,NordicNest.ContentApi",
        },
        {
            "type": "NordicNest.Core.Contentful.Model.SiteLayout.Footer,NordicNest.Core"
        },


...
  • Related