here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635
Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0
CodePudding user response:
So you have two problems impacting you.
you should wait for the page to load after you get() the webpage. You can do this with something like
import time
andtime.sleep(10)
The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a
data-type='paragraph'
stays constant, therefore you are able to do:
paragraphs = driver.find_elements(By.XPATH, '//*[@data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))
prints: 2
after the page is loaded.
CodePudding user response:
Just to add-on to @Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.
paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[@data-type="paragraph"]'))
)
print(len(paragraphs))