I am trying to parse the json string from this page but couldn't make it to convert the json into python dict via json.loads
. Here is my starter code:
import requests
import re
import json
import html
headers = {
"authority": "www.budgetpetproducts.com.au",
"accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "en,ru;q=0.9",
"cache-control": "max-age=0",
# 'cookie': '_ALGOLIA=anonymous-6d2b7032-518b-4da0-b0b9-964e00920f3d; scarab.visitor="45DBAC5A87F9C262"; _fbp=fb.2.1674782859328.577366252; _tt_enable_cookie=1; _ttp=jTkn1Rm860eDgPiXlG45pfHRDG7; _gid=GA1.3.415748101.1675529773; _ga=GA1.3.583864494.1674782855; _uetsid=d505d7e0a4ac11ed962a7de239a71ea1; _uetvid=c75d7e109de111eda4efb3023338d7ad; XSRF-TOKEN=eyJpdiI6IkJjNVpiMzNQRGFRYzJ6Z1ZUR2NEVFE9PSIsInZhbHVlIjoiSGNyN2J4dUtza1JadUhqUFAwWklWcFdIVmZLSC84OEp4TUtRYWdLQzBhUU1GK2psMzFHQW5SVTlZZm1Yd0xmaGt3QWFoRDVsNVYyRGdKYVRKbUZSak9UMzlCZEhXc0FubjdORERraE5nRHNsYWViVkxZZCt6d2VkbGNGNjhNWlciLCJtYWMiOiIyZjQxMzk5YjZjZWNmM2E1MmVjYmQxODAxNDY3ZWY1MTZiM2MyNzcwODBmY2ZlNWM5YTVmMDU4MWMwMDViZjQ5In0=; budget_pet_products_session=eyJpdiI6IkdnVHVrTjlkUGY1SGxMN0lZWTVsckE9PSIsInZhbHVlIjoibERhNnllekN5azRhMEptSU9QeGZ3VkVaYUpsUkxGbi9rR21yMXArWEoycWdlMWJSZFRnL3BtOUhBMXBXQ0syVEJPbUtaOVVpLzRkdFBwRDZsUU45V3lxd1JzK2lTU1RyLzkyM24xcTR2TUUvQVdrdEV0VHpoRDFIVFNBZHJTVTgiLCJtYWMiOiIwNDAzMTM3YmQ5YWE3ZWY3NmE3MjA3OGEyMTZmMWM5ODY3ZjlkOGZjNDEyYzkzNmM5MzhjY2ZkODcyNmU3N2NjIn0=; scarab.profile="1677|1675608432|8787|1675608372|3624|1675146623"; _ga_6YGE1ZKCTV=GS1.1.1675608368.2.1.1675608533.0.0.0',
"referer": "https://www.budgetpetproducts.com.au/dog/food?sort=best_match",
"sec-ch-ua": '"Not?A_Brand";v="8", "Chromium";v="108", "Yandex";v="23"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Linux"',
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 YaBrowser/23.1.1.1038 (beta) Yowser/2.5 Safari/537.36",
}
response = requests.get(
"https://www.budgetpetproducts.com.au/product/royal-canin-maxi-adult-dry-dog-food-4kg/1679",
headers=headers,
)
data = (
re.search(":data=(.*)", response.text)
.group(1)
.replace(':is-mobile="false"></product-page-component>', "")
.replace(""", '"')
.replace("'", '\\"')
)
clean_data = html.unescape(data)
json_blob = json.loads(clean_data)
print(json_blob)
Above code is giving JSONDecodeError
:
File "/usr/lib/python3.9/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 4 (char 3)
I check the json if it is valid or not on json formatter it is valid on that here is the link https://jsoneditoronline.org/#left=cloud.c7e2f35696094a07a49afba2d18c6ad4
Can anyone please help me out here? Thanks
CodePudding user response:
As advised in the comment by Abolfazi Ghaemi change appropriate line to:
json_blob = json.loads(clean_data[1:-2])
in order to get printed:
{'info': {'id': 1679, 'name': {'title': 'Royal Canin Maxi Adult Dry Dog Food 4kg', 'text': 'Royal Canin Maxi Adult Dry Dog Food 4kg', 'icon': None, 'slug': 'royal-canin-maxi-adult-dry-dog-food-4kg'}, 'category': {'html': ['<a href="https://www.budgetpetproducts.com.au/Dog > 1/" data-ajax="false" style="text-decoration:none">Dog</a>', ...