Home > Enterprise >  How to solve Python selenium webdriver returning blank files
How to solve Python selenium webdriver returning blank files

Time:12-21

I'm currently in the process of systematically scraping data of an online retailer's website. I have been doing this once every week now for 2 months and my Python Code has been working great but when I tried to run it today, it returned blank files instead of my usual data. I tried multiple ways to solve this but haven't managed to fix it. I tried switching to geckodriver but same result. I also updated my selenium, chromedriver, chrome... but no luck. Has someone suggestions on fixing this? (this is my first post so hopefully I displayed the code clearly)

        from bs4 import BeautifulSoup
        import re
        import csv
        from selenium import webdriver
        import numpy


        url = "https://www.zalando.be/sportsokken/_zwart/"

        driver = webdriver.chrome(executable_path = "/Users/lisabyloos/Downloads/chromedriver")
        pages = numpy.arange(1,3,1)
        for page in pages:
          driver.get(url   "?p="   str(page))
          html_content = driver.execute_script('return document.body.innerHTML')

          soup = BeautifulSoup(html_content, "lxml")

          product_divs = soup.find_all("div", attrs={"class": "_4qWUe8 w8MdNG cYylcv QylWsg SQGpu8 iOzucJ JT3_zV DvypSJ"})

          results = []

          for product in product_divs:
            results.append(product.get_text(separator=";"))

          import pandas as pd
          df = pd.DataFrame([sub.split(";") for sub in results])
          df.to_csv("myfile"   str(page)   ".csv" )

CodePudding user response:

What happens?

Classes of elements you try to find are dynamically generated and have changed.

Note Pages change from time to time, but changes to structure are rarer than to styles. Therefore, it is always a good strategy to use elements or ids rather than classes for selection.

How to fix?

Adjust selecting criteria to get your results:

product_divs = soup.find_all('article')
  • Related