Home > database >  Web scraping Python beautifulsoup
Web scraping Python beautifulsoup

Time:03-09

I'm trying to create a crawler that scans the website https://www.superherodb.com/ and fetches the information on all the superheroes (seen on:https://www.superherodb.com/characters) from their individual pages. I want to fetch all the information on the hero: the stats, powers, equipment, origin, connections, etc. But I am having trouble accessing their stats from the hero's page.

For example, this page: https://www.superherodb.com/001/10-39302/

For the Power Stats section in the hero's page I tried:

  bs_test.find_all("div", {"class": "stat-value"})

and:

    bs_test.select(".stat-value")

But the output always outputs 0 as the value:

[<div >0</div>,
 <div >0</div>,
 <div >0</div>,
 <div >0</div>,
 <div >0</div>,
 <div >0</div>,
 <div >0</div>]

What am I missing here? Please help me.

CodePudding user response:

They aren't visible there. Try scraping <> rather than the stat_value. It provides the following data:

stats_10_39302_shdb = {"stats":{"int":140,"str":45,"spe":5,"dur":5,"pow":0,"com":20,"tie":0},"bars":{"int":70,"str":1,"spe":1,"dur":5,"pow":0,"com":20,"tie":0}

for the Han example.

CodePudding user response:

The data is injected by JS after the page loads, but requests.get only gives you the static HTML, which has placeholder values alongside a <script> tag with a JSON-formatted JS object with the actual data.

Following up on the astute answer from bensonium, here's how you can pull the data out of the .footnote script element:

import json
import re
import requests

response = requests.get("https://www.superherodb.com/001/10-39302/")
response.raise_for_status()
stats = [json.loads(x) for x in re.findall(r'{"stats":[^;] ', response.text)]
print(json.dumps(stats, indent=2))

Output:

[
  {
    "stats": {
      "int": 140,
      "str": 45,
      "spe": 5,
      "dur": 5,
      "pow": 0,
      "com": 20,
      "tie": 0
    },
    "bars": {
      "int": 70,
      "str": 1,
      "spe": 1,
      "dur": 5,
      "pow": 0,
      "com": 20,
      "tie": 0
    },
    "shdbclass": {
      "value": 10,
      "visual": 10,
      "level": 1
    },
    "specials": {
      "omnipotent": 0,
      "omniscient": 0,
      "omnipresent": 0
    }
  },
  {
    "stats": {
      "int": 100,
      "str": 100,
      "spe": 10,
      "dur": 1,
      "pow": 1,
      "com": 1,
      "tie": 9
    },
    "bars": {
      "int": 50,
      "str": 1,
      "spe": 6,
      "dur": 1,
      "pow": 1,
      "com": 1,
      "tie": 90
    },
    "shdbclass": {
      "value": 20.5,
      "visual": 21,
      "level": 1
    },
    "specials": {
      "omnipotent": 0,
      "omniscient": 0,
      "omnipresent": 0
    },
    "ustats": 1
  }
]

See the canonical Web-scraping JavaScript page with Python for a generalization of this approach and more explanations and strategies for scraping JS-driven pages.

  • Related