Python | Selenium Issue with scrolling down and find by class name-CodePudding

For one study research I would like to scrape some links from webpages which located out of viewport (to see this links you need to scroll down the page).

Page example (https://www.twitch.tv/lirik)
Link example: https://www.amazon.com/dp/B09FVR22R2
Link located in div class='Layout-sc-nxg1ff-0 itdjvg default-panel' (in total 16 links on the page).

I have write the script but I get empty list:

from selenium import webdriver
import time

browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')

time.sleep(3)
browser.execute_script("window.scrollBy(0,document.body.scrollHeight)")

time.sleep(3)

panel_blocks = browser.find_elements(by='class name', value='Layout-sc-nxg1ff-0 itdjvg default-panel')
browser.close()
print(panel_blocks)
print(type(panel_blocks))

I just get empty list after page was loaded. Here is output from the script above:

/usr/local/bin/python /Users/greg.fetisov/PycharmProjects/baltazar_platform/Twitch_parser.py
[]
<class 'list'>

Process finished with exit code 0

p.s. when webdriver opens the page, I see there is no scroll down action. It just open a page and then close it after time.sleep cooldown.

How I can change the script to get the links properly?

Any help or advice would be appreciated!

CodePudding user response：

To print the values of the href attribute you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR:

driver.get("https://www.twitch.tv/lirik")
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.Layout-sc-nxg1ff-0.itdjvg.default-panel > a")))])

Console Output:

['https://www.amazon.com/dp/B09FVR22R2', 'http://bs.serving-sys.com/Serving/adServer.bs?cn=trd&pli=1077437714&gdpr=${GDPR}&gdpr_consent=${GDPR_CONSENT_68}&adid=1085757156&ord=[timestamp]', 'https://store.epicgames.com/lirik/rumbleverse', 'https://bitly/3GP0cM0', 'https://lirik.com/', 'https://streamlabs.com/lirik', 'https://twitch.amazon.com/tp', 'https://www.twitch.tv/subs/lirik', 'https://www.youtube.com/lirik?sub_confirmation=1', 'http://www.twitter.com/lirik', 'http://www.instagram.com/lirik', 'http://gfuel.ly/lirik', 'http://www.cyberpowerpc.com/', 'https://www.cyberpowerpc.com/page/Intel/LIRIK/', 'https://discord.gg/lirik', 'http://www.amazon.com/?_encoding=UTF8&camp=1789&creative=390957&linkCode=ur2&tag=l0e6d-20&linkId=YNM2SXSSG3KWGYZ7']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

CodePudding user response：

You are using a wrong locator.
You should use expected conditions explicit waits instead of hardcoded pauses.
find_elements method returns a list of web elements while you want to the link inside the element(s).

This should work better:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

browser = webdriver.Firefox()
browser.get('https://www.twitch.tv/lirik')
wait = WebDriverWait(browser, 20)

wait.until(EC.element_to_be_clickable((By.XPATH, "//div[@class='channel-panels-container']//a")))
time.sleep(0.5)

link_blocks = browser.find_element_by_xpath("//div[@class='channel-panels-container']//a")
for link_block in link_blocks:
    link = link_block.get_attribute("href")
    print(link)

browser.close()