I want to scrape the data from a script tag in json format as seen below with BeautifulSoup but I encounter an Expecting value: line 1 column 1 (char 0)
error which implies that the variable is empty. What am I missing here?
#PYTHON:
a = soup.find("script", type="application/ld json")
a = str(a)
print (a)
data = dict()
script_dict = json.loads(a.replace("'",'"'))
print (script_dict)
data["author"] = script_dict["author"]
data["embed_url"] = script_dict["embedUrl"]
data["duration"] = ":".join(re.findall(r"\d\d",script_dict["duration"]))
data["upload_date"] = re.findall(r"\d{4}-\d{2}-\d{2}",script_dict["uploadDate"])[0]
data["accurate_views"] = int(script_dict["interactionStatistic"][0]["userInteractionCount"].replace(",",""))
Data to be scraped:
<script type="application/ld json">
{
"@context": "http://schema.org/",
"@type": "DATA",
"name": "Klaus ;",
"embedUrl": "http://example.com",
"duration": "PT00H11M27S",
"uploadDate": "2022-07-30T13:12:05 00:00",
"description": "SOMETEXT;",
"author" : "Klaus", "interactionStatistic": [
{
"@type": "InteractionCounter",
"interactionType": "http://schema.org/WatchAction",
"userInteractionCount": "4,924,277"
},
{
"@type": "InteractionCounter",
"interactionType": "http://schema.org/LikeAction",
"userInteractionCount": "10,469"
}
]
}
</script>
CodePudding user response:
Don't convert the tag to string with str()
. Use .text
property and then json.loads
:
import json
from bs4 import BeautifulSoup
s = """\
<script type="application/ld json">
{
"@context": "http://schema.org/",
"@type": "DATA",
"name": "Klaus ;",
"embedUrl": "http://example.com",
"duration": "PT00H11M27S",
"uploadDate": "2022-07-30T13:12:05 00:00",
"description": "SOMETEXT;",
"author" : "Klaus", "interactionStatistic": [
{
"@type": "InteractionCounter",
"interactionType": "http://schema.org/WatchAction",
"userInteractionCount": "4,924,277"
},
{
"@type": "InteractionCounter",
"interactionType": "http://schema.org/LikeAction",
"userInteractionCount": "10,469"
}
]
}
</script>"""
soup = BeautifulSoup(s, "html.parser")
data = soup.find("script", type="application/ld json")
data = json.loads(data.text)
print(data)
Prints:
{
"@context": "http://schema.org/",
"@type": "DATA",
"name": "Klaus ;",
"embedUrl": "http://example.com",
"duration": "PT00H11M27S",
"uploadDate": "2022-07-30T13:12:05 00:00",
"description": "SOMETEXT;",
"author": "Klaus",
"interactionStatistic": [
{
"@type": "InteractionCounter",
"interactionType": "http://schema.org/WatchAction",
"userInteractionCount": "4,924,277",
},
{
"@type": "InteractionCounter",
"interactionType": "http://schema.org/LikeAction",
"userInteractionCount": "10,469",
},
],
}