Home > Enterprise >  Selenium webscraper not scraping desired tags
Selenium webscraper not scraping desired tags

Time:01-05

here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635

Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?

from selenium import webdriver

from selenium.webdriver.common.by import By

url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'

driver = webdriver.Chrome()

driver.get(url)

paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')

print(len(paragraphs)) # => prints 0

CodePudding user response:

So you have two problems impacting you.

  1. you should wait for the page to load after you get() the webpage. You can do this with something like import time and time.sleep(10)

  2. The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a data-type='paragraph' stays constant, therefore you are able to do:

paragraphs = driver.find_elements(By.XPATH, '//*[@data-type="paragraph"]') # search by XPath to find the elements with that data attribute

print(len(paragraphs))

prints: 2 after the page is loaded.

CodePudding user response:

Just to add-on to @Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.

paragraphs = WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.XPATH, '//*[@data-type="paragraph"]'))
)
print(len(paragraphs))
  • Related