I have built a web scraper with python using selenium. It runs without errors and opens the requested url (even though just one page and not all). But after the code has been run, there is no output. The csv I create using pandas is empty.
Looking at my code, do you see, why it does not scrape the items?
for i in range(0, 10):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives?page=' str(i)
driver.get(url)
time.sleep(random.randint(1, 11))
driver.find_elements(By.CSS_SELECTOR, "initivative-item")
initiative_list = []
title = video.find_element(By.XPATH, "./html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[2]/article/a/div[2]").text
topic = video.find_element(By.XPATH, ".///html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[1]/article/a/div[3]/div[2]").text
period = video.find_element(By.XPATH, ".///html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[1]/article/a/div[5]/div/div[2]").text
initiative_item = {
'title': [title],
'topic': [topic],
'period': [period]
}
initiative_list.extend(initiative_item)
df = pd.DataFrame(initiative_list)
print(df)
df.to_csv('file_name.csv')
I have checked the xpaths and they seem to be correct, because they do not cause any errors.
CodePudding user response:
Could you confirm that your variables title
, topic
and period
are not empty?
If not, isnt somewhere in your cycle initialisation of your initiative_list
set to initiative_list = []
?
That will remove all of the content already appended to your list.
CodePudding user response:
This should work:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_en'
driver.get(url)
# We save the article list
articles = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//article")))
# We make a loop once per article in the loop
for i in range(1, len(articles)):
# We save title, topic and period
title = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[2]"))).text
print(title)
topic = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[3]/div[2]"))).text
print(topic)
period = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[5]/div/div[2]"))).text
print(period)
Once you have the info you can do whatever you want with it.
I hope it helps