Home > Blockchain >  InvalidArgumentException: invalid argument error using Selenium and Pandas scraping urls reading fro
InvalidArgumentException: invalid argument error using Selenium and Pandas scraping urls reading fro

Time:12-12

I was trying to scrape a website, where possible urls are in csv. So after reading it through for loop to call my method, where I would open the url and gonna scrap the contents of the site.

But due to some reason I am unable to loop and open the urls

Here is my code:

from selenium import webdriver
import time
import pandas as pd

chrome_options = webdriver.ChromeOptions();
chrome_options.add_argument('--user-agent="Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166"')

driver = webdriver.Chrome(chrome_options=chrome_options)

csv_data = pd.read_csv("./data_new.csv")
df = pd.DataFrame(csv_data);

urls = df['url'] 

print(urls[: 5])

def scrap_site(url):
    
    print("Recived URL ---> ", url)
    
    driver.get(url)
    time.sleep(5)

    driver.quit()

for url in urls:
    print("URL ---> ", url)
    scrap_site(url)
    

Error on console I am getting

Traceback (most recent call last):

  File "/media/sf_shared_folder/scrape_project/course.py", line 56, in <module>
    scrap_site(url)

  File "/media/sf_shared_folder/scrape_project/course.py", line 35, in scrap_site
    driver.get(url)

  File "/home/mujib/anaconda3/envs/spyder/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
    self.execute(Command.GET, {'url': url})

  File "/home/mujib/anaconda3/envs/spyder/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)

  File "/home/mujib/anaconda3/envs/spyder/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)

InvalidArgumentException: invalid argument
  (Session info: chrome=96.0.4664.45)

CSV file have following format

url
http://www.somesite.com
http://www.someothersite.com

CodePudding user response:

For a CSV file like:

csv

Once you read through pandas and store it in a DataFrame and create a list with the command:

urls = df['url']

On printing the list you will observe the list items contains the column indexes as:

urls = df['urls']
print(urls)

Console Output:

0    https://stackoverflow.com/
1    https://www.google.com/
Name: urls, dtype: object

Here the urls within the list are not valid urls. Hence you see the error for the command:

self.execute(Command.GET, {'url': url})

Solution

Effectively you need to suppress the column indexes using tolist() as follows:

urls = df['urls'].tolist()
print(urls)

Console Output:

['https://stackoverflow.com/', 'https://www.google.com/']

This list of urls are valid urls to invoke through get() for further scraping.

CodePudding user response:

You need to put driver = webdriver.Chrome(chrome_options=chrome_options) inside the loop. Once driver.quit() is called you have to define driver again.

from selenium import webdriver
import time
import pandas as pd

chrome_options = webdriver.ChromeOptions();
chrome_options.add_argument('--user-agent="Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166"')

# driver = webdriver.Chrome(chrome_options=chrome_options)

csv_data = pd.read_csv("./data_new.csv")
df = pd.DataFrame(csv_data);

# urls = df['url'] 
urls = ['https://stackoverflow.com/',
    'https://www.yahoo.com/']

print(urls[: 5])

def scrap_site(url):
    ############## OPEN THE DRIVER HERE ##############
    driver = webdriver.Chrome(chrome_options=chrome_options)
    ############## OPEN THE DRIVER HERE ##############
    
    print("Recived URL ---> ", url)
    
    driver.get(url)
    time.sleep(5)

    driver.quit()

for url in urls:
    print("URL ---> ", url)
    scrap_site(url)
  • Related