I am trying to scrape the images of this website, but I am unable to get the images src
and rather getting the lazy loading src
attribute of the images.
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
url = "https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/squads"
s = Service("M:\WebScraping\chromedriver.exe")
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.get(url)
time.sleep(5)
driver.execute_script("window.scrollTo(0, 500);")
page = urllib.request.urlopen(url)
doc = BeautifulSoup(page, "html.parser")
teams = doc.find(class_="ds-p-0").find(class_="ds-mb-4")
for team in teams:
print(team.img["src"])
file_name = team.img["alt"]
img_file = open(file_name ".png", "wb")
img_file.write(urllib.request.urlopen(team.img["src"]).read())
img_file.close()
This is the output I am receiving. (Which are just lazy loaded images)
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
But I rather want to get the actual src of the image as in these -
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
CodePudding user response:
BeautifulSoup is not able to load javascript and other stuff, that's why when you run
page = urllib.request.urlopen(url)
doc = BeautifulSoup(page, "html.parser")
you get the lazy image links. On the other side, Selenium is able to load almost everything, so you can load the page with Selenium and then pass its page source to BeautifulSoup as parameter instead of the url:
doc = BeautifulSoup(driver.page_source, "html.parser")
In this way BeautifulSoup will use the full HTML of the page. The following code prints the urls both with Selenium and BeautifulSoup, so that you can see both techniques.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chromedriver_path = '...'
driver = webdriver.Chrome(service=Service(chromedriver_path), options=options)
url = "https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/squads"
driver.get(url)
# wait (up to 20 seconds) until the images are visible on page
images = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".ds-p-0 .ds-mb-4 img")))
# scroll to the last image, so that all images get rendered correctly
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', images[-1])
time.sleep(2)
# PRINT URLS USING SELENIUM
print('Selenium')
for img in images:
print(img.get_attribute('src'))
# PRINT URLS USING BEAUTIFULSOUP
doc = BeautifulSoup(driver.page_source, "html.parser")
teams = doc.find(class_="ds-p-0").find(class_="ds-mb-4")
print('BeautifulSoup')
for team in teams:
print(team.img["src"])
Output
Selenium
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313421.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313422.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/334700/334707.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313419.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/344000/344062.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/317000/317003.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313423.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313418.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313480.logo.png
BeautifulSoup
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313421.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313422.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/334700/334707.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313419.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/344000/344062.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/317000/317003.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313423.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313418.logo.png
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/313400/313480.logo.png