Home > other >  Why does my web scraper not scrape the relevant information?
Why does my web scraper not scrape the relevant information?

Time:12-06

I have built a web scraper with python using selenium. It runs without errors and opens the requested url (even though just one page and not all). But after the code has been run, there is no output. The csv I create using pandas is empty.

Looking at my code, do you see, why it does not scrape the items?

for i in range(0, 10):
    url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives?page='   str(i)
    driver.get(url)
    time.sleep(random.randint(1, 11))
    driver.find_elements(By.CSS_SELECTOR, "initivative-item")
    initiative_list = []
    title = video.find_element(By.XPATH, "./html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[2]/article/a/div[2]").text
    topic = video.find_element(By.XPATH, ".///html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[1]/article/a/div[3]/div[2]").text
    period = video.find_element(By.XPATH, ".///html/body/app-root/ecl-app-standardised/main/div/ng-component/div/section/ux-block-content/div/initivative-item[1]/article/a/div[5]/div/div[2]").text
    initiative_item = {
        'title': [title],
        'topic': [topic],
        'period': [period]
    }

    initiative_list.extend(initiative_item)

df = pd.DataFrame(initiative_list) 
print(df) 
df.to_csv('file_name.csv')

I have checked the xpaths and they seem to be correct, because they do not cause any errors.

CodePudding user response:

Could you confirm that your variables title, topic and period are not empty?

If not, isnt somewhere in your cycle initialisation of your initiative_list set to initiative_list = [] ? That will remove all of the content already appended to your list.

CodePudding user response:

This should work:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_en'
driver.get(url)

# We save the article list
articles = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//article")))

# We make a loop once per article in the loop
for i in range(1, len(articles)):
    # We save title, topic and period
    title = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[2]"))).text
    print(title)
    topic = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[3]/div[2]"))).text
    print(topic)
    period = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"(//article)[{i}]//div[5]/div/div[2]"))).text
    print(period)

Once you have the info you can do whatever you want with it.

I hope it helps

  • Related