I'm trying to create a crawler that scans the website https://www.superherodb.com/ and fetches the information on all the superheroes (seen on:https://www.superherodb.com/characters) from their individual pages. I want to fetch all the information on the hero: the stats, powers, equipment, origin, connections, etc. But I am having trouble accessing their stats from the hero's page.
For example, this page: https://www.superherodb.com/001/10-39302/
For the Power Stats section in the hero's page I tried:
bs_test.find_all("div", {"class": "stat-value"})
and:
bs_test.select(".stat-value")
But the output always outputs 0 as the value:
[<div >0</div>,
<div >0</div>,
<div >0</div>,
<div >0</div>,
<div >0</div>,
<div >0</div>,
<div >0</div>]
What am I missing here? Please help me.
CodePudding user response:
They aren't visible there. Try scraping <> rather than the stat_value. It provides the following data:
stats_10_39302_shdb = {"stats":{"int":140,"str":45,"spe":5,"dur":5,"pow":0,"com":20,"tie":0},"bars":{"int":70,"str":1,"spe":1,"dur":5,"pow":0,"com":20,"tie":0}
for the Han example.
CodePudding user response:
The data is injected by JS after the page loads, but requests.get
only gives you the static HTML, which has placeholder values alongside a <script>
tag with a JSON-formatted JS object with the actual data.
Following up on the astute answer from bensonium, here's how you can pull the data out of the .footnote script
element:
import json
import re
import requests
response = requests.get("https://www.superherodb.com/001/10-39302/")
response.raise_for_status()
stats = [json.loads(x) for x in re.findall(r'{"stats":[^;] ', response.text)]
print(json.dumps(stats, indent=2))
Output:
[
{
"stats": {
"int": 140,
"str": 45,
"spe": 5,
"dur": 5,
"pow": 0,
"com": 20,
"tie": 0
},
"bars": {
"int": 70,
"str": 1,
"spe": 1,
"dur": 5,
"pow": 0,
"com": 20,
"tie": 0
},
"shdbclass": {
"value": 10,
"visual": 10,
"level": 1
},
"specials": {
"omnipotent": 0,
"omniscient": 0,
"omnipresent": 0
}
},
{
"stats": {
"int": 100,
"str": 100,
"spe": 10,
"dur": 1,
"pow": 1,
"com": 1,
"tie": 9
},
"bars": {
"int": 50,
"str": 1,
"spe": 6,
"dur": 1,
"pow": 1,
"com": 1,
"tie": 90
},
"shdbclass": {
"value": 20.5,
"visual": 21,
"level": 1
},
"specials": {
"omnipotent": 0,
"omniscient": 0,
"omnipresent": 0
},
"ustats": 1
}
]
See the canonical Web-scraping JavaScript page with Python for a generalization of this approach and more explanations and strategies for scraping JS-driven pages.