Python Selenium-BeautifulSoup-CodePudding

paths = ['/html/body/div[2]/main/div[2]/div[2]/div[1]/section/div[2]/div[2]/div/div[1]/a/h3',
'/html/body/div[2]/main/div[2]/div[2]/div[1]/section/div[2]/div[3]/div/div[1]/a/h3',
'/html/body/div[2]/main/div[2]/div[2]/div[1]/section/div[2]/div[4]/div/div[1]/a/h3',]

urls = []
for path in paths:
    element = driver.find_element(By.XPATH, path)
    url = element.get_attribute('href')
    urls.append(url)
    element.click()
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    element = soup.find('p')
    if element:
        element_text = element.get_text()
        print(element_text)
    else:
        print(f"No p tag found in {url}")

driver.get(url)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 455, in get
self.execute(Command.GET, {"url": url})
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 444, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 249, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: 'url' must be a string

This is the problem. How can I fix it

CodePudding user response：

Per the (unofficial) documentation, if the attribute is, for whatever reason (ex. wrong XPath, wrong attribute name, etc.), unable to get the attribute from the element, it returns None. None is not a string, it's a builtin type. That's most likely your issue.

This method will first try to return the value of a property with the given name. If a property with that name doesn’t exist, it returns the value of the attribute with the same name. If there’s no attribute with that name, None is returned.

CodePudding user response：

The xpath:

/html/body/div[2]/main/div[2]/div[2]/div[1]/section/div[2]/div[2]/div/div[1]/a/h3

refers to the <h3> element which doesn't have any href value. But ofcoarse the parent <a> tag have the href attribute set.

So the values assigned to url was always NULL in each iteration:

url = element.get_attribute('href')

Hence you see the error:

selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: 'url' must be a string

Solution

You need to move up to the parent <a> tag and extract the href attribute value and you can use the following locator strategy:

/html/body/div[2]/main/div[2]/div[2]/div[1]/section/div[2]/div[2]/div/div[1]/a/h3//parent::a[1]