Python Script

import requests
import json
from bs4 import BeautifulSoup
import re

url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')

# Save source code to file for testing
with open("sourcecode.html", "w", encoding='utf-8') as file:
    file.write(str(soup))

# Regex pattern to capture JSON data within webpage source code
regex_pattern = r"{\"delivery\"*.*false*}}}"

URL: https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125

I'm trying to pull the JSON data embedded within the source code of the URL listed above using Regex.

I have manually pulled the source code from the URL listed and entered into regex101.com using the following regex pattern:

{\"delivery\"*.*false*}}}

The regex pattern appears to capture the desired JSON data needed.

Issue

When I view the contents of the soup variable or saved file it appears to capture the HTML source code.
However, I do not know how to process regex to only capture the JSON data string needed to build my desired Python Dictionary.

Any help would be greatly appreciated.

CodePudding user response：

Maybe something like this can help you:

url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'

r = requests.get(url)
source_text = r.text
# Regex for extract info
json = re.findall('put your regex here', source_text)

To convert the returned list to json you can use:

import json
json_format = json.dumps(json)