Python Script
import requests
import json
from bs4 import BeautifulSoup
import re
url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
# Save source code to file for testing
with open("sourcecode.html", "w", encoding='utf-8') as file:
file.write(str(soup))
# Regex pattern to capture JSON data within webpage source code
regex_pattern = r"{\"delivery\"*.*false*}}}"
I'm trying to pull the JSON data embedded within the source code of the URL listed above using Regex.
I have manually pulled the source code from the URL listed and entered into regex101.com using the following regex pattern:
{\"delivery\"*.*false*}}}
The regex pattern appears to capture the desired JSON data needed.
Issue
When I view the contents of the soup variable or saved file it appears to capture the HTML source code.
However, I do not know how to process regex to only capture the JSON data string needed to build my desired Python Dictionary.
Any help would be greatly appreciated.
CodePudding user response:
Maybe something like this can help you:
url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'
r = requests.get(url)
source_text = r.text
# Regex for extract info
json = re.findall('put your regex here', source_text)
To convert the returned list to json you can use:
import json
json_format = json.dumps(json)