Home > other >  How to Extract JSON From HTML Source Code Using Regex
How to Extract JSON From HTML Source Code Using Regex

Time:09-17

Python Script

import requests
import json
from bs4 import BeautifulSoup
import re

url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')

# Save source code to file for testing
with open("sourcecode.html", "w", encoding='utf-8') as file:
    file.write(str(soup))

# Regex pattern to capture JSON data within webpage source code
regex_pattern = r"{\"delivery\"*.*false*}}}"

URL: https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125

I'm trying to pull the JSON data embedded within the source code of the URL listed above using Regex.

I have manually pulled the source code from the URL listed and entered into regex101.com using the following regex pattern:

{\"delivery\"*.*false*}}}

The regex pattern appears to capture the desired JSON data needed.

Issue

When I view the contents of the soup variable or saved file it appears to capture the HTML source code.
However, I do not know how to process regex to only capture the JSON data string needed to build my desired Python Dictionary.

Any help would be greatly appreciated.

CodePudding user response:

Maybe something like this can help you:

url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'

r = requests.get(url)
source_text = r.text
# Regex for extract info
json = re.findall('put your regex here', source_text)

To convert the returned list to json you can use:

import json
json_format = json.dumps(json)
  • Related