Home > Enterprise >  Python BeautifulSoup & Selenium not scraping full html
Python BeautifulSoup & Selenium not scraping full html

Time:06-28

Beginner web-scraper here. My practice task is simple: Collect/count a player's Pokemon usage over their last 50 games, on this page for example. To do this, I planned to use the image url of the Pokemon which contains the Pokemon's name (in an <img> tag, encased by <span></span>). Inspecting from Chrome looks like this: <img alt="Played pokemon" srcset="/_next/image?url=/Sprites/t_Square_Snorlax.png&amp;w=96&amp;q=75 1x, /_next/image?url=/Sprites/t_Square_Snorlax.png&amp;w=256&amp;q=75 2x" ...

1) Using Beautiful Soup alone doesn't get the html of the images that I need:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://uniteapi.dev/p/ほばち')
wp_player = bs(r.content)
wp_player.select('span img')

2) Using Selenium picks up some of what BeautifulSoup missed:

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://uniteapi.dev/p/ほばち"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
page = driver.page_source
driver.quit()

soup = bs(page, 'html.parser')
soup.select('span img')

But it gives me links that look like this: <img alt="Played pokemon" data-nimg="fixed" decoding="async" src=""

What am I misunderstanding here? The website I'm interested in does not have a public API, despite its name. Any help is much appreciated.

CodePudding user response:

This is a common issue while web scraping websites before these gets loaded completely. What you'll have to do is basically wait for the page to fully load the images that you are requiring. You have two options, either implicit wait or explicit wait for the image elements to get loaded.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

url = r"https://uniteapi.dev/p/ほばち"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[alt="Played pokemon"]'))) # EXPLICIT WAIT
driver.implicitly_wait(10) # IMPLICIT WAIT

pokemons = driver.find_elements_by_css_selector('[alt="Played pokemon"]')
for element in pokemons:
    print(element.get_attribute("src"))

You have to choose one or the other, but it's better to explicit wait for the element(s) to get rendered before you try to access to their values.

OUTPUT:
pokemons = driver.find_elements_by_css_selector('[alt="Played pokemon"]') https://uniteapi.dev/_next/image?url=/Sprites/t_Square_Tsareena.png&w=256&q=75 https://uniteapi.dev/_next/image?url=/Sprites/t_Square_Tsareena.png&w=256&q=75 https://uniteapi.dev/_next/image?url=/Sprites/t_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=/Sprites/t_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=/Sprites/t_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=/Sprites/t_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=/Sprites/t_Square_Snorlax.png&w=256&q=75

Your workaround wasn't working because you are doing a get request to the page that gets you the html values at their initial state, when all the DOM elements are still yet to get rendered.

CodePudding user response:

The reason is that this site is using what is called Ajax to load the Pokémon dynamically via JavaScript.

One thing you can do is actually observe the network tab in the debugger and look for the url that contains the data, and if you can call the url directly that returns the data you are looking for.

A lot of times when web scraping, you can do this and it’ll return the data in a more serialized format.

Otherwise you can do as was mentioned in Sac’s answer and just wait for the data to fully load. Either by checking if an element has loaded yet, or just hard coding a sleep call, which is less clean.

  • Related