Unable to retrieve the href attributes using Python and Selenium-CodePudding

I'm very new to this and have spent hours trying various methods I've read here. Apologies if I'm making some silly mistake

I want to create a database of my LEGO sets. Pulling images and info from brickset.com

I'm using:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
anchors = [a.get_attribute('href') for a in anchors]

print (anchors) returns:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')

What I'm trying to target:

div id="ui-tabs-2"  aria-live="polite" aria-labelledby="ui-id-4" role="tabpanel" aria-expanded="true" aria-hidden="false" style="display: block;">
<ul >
<li>
<a href="https://images.brickset.com/sets/AdditionalImages/21054-1/21054_alt10.jpg"  onclick="return hs.expand(this)">
<img src="https://images.brickset.com/sets/AdditionalImages/21054-1/tn_21054_alt10_jpg.jpg" title="" one rror="this.src='/assets/images/spacer2.png'" loading="lazy">
</a><div >

I'm losing my mind trying to figure this out.

Update Still not getting the href attributes. To add more detail, I'm trying to get the images under the "images" tab on this URL: https://brickset.com/sets/21330-1/Home-Alone Here is the problematic code:

anchors = driver.find_elements(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [anchors.get_attribute('href') for a in anchors]
print('Found '   str(len(anchors))   ' links to images')

I've also tried:

#anchors = driver.find_elements_by_css_selector("a[href*='21330']")

This only returned one href, even though there should be about a dozen.

Thank you all for the assistance!

CodePudding user response：

You shouldn't be using the same name for multiple variables.

As per the first line of code:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')

anchors is the list of WebElements. Ideally to create another list with the href attributes you should use another name, e.g. hrefs

Effectively your code block will be:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
hrefs = [a.get_attribute('href') for a in anchors]
print(hrefs)

Using list comprehension in a single line:

print(a.get_attribute('href') for a in driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a'))

CodePudding user response：

First thing, driver.find_elements_by_xpath is deprecated, use driver.find_element(By.XPATH, 'locator') instead.

Now, if you'd like to get all hrefs of the links on the page:

elements = driver.find_element(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [element.get_attribute('href') for element in elements]

Notice that I'm not using [1] to get a single element, but rather all elements.

CodePudding user response：

You might want to try this.

NOTE: I'm not using selenium here.

import time

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0",
}

sample_urls = [
    "https://brickset.com/sets/21330-1/Home-Alone",
    "https://brickset.com/sets/21101-1/Hayabusa"
]

with requests.Session() as s:
    for sample_url in sample_urls:
        ajax_setID = [
            a["href"] for a in
            BeautifulSoup(s.get(sample_url, headers=headers).text, "lxml").find_all("a")
            if "mainImage" in a["href"]
        ][0]
        image_url = f"https://brickset.com{ajax_setID}&_{int(time.time() * 1000)}"
        headers.update(
            {
                "Referer": sample_url,
                "X-Requested-With": "XMLHttpRequest",
            }
        )
        source_image = (
            BeautifulSoup(
                s.get(image_url, headers=headers).text, "lxml"
            ).find("img")["src"]
        )
        print(f"{sample_url.split('/', -1)[-1]} -> {source_image}")

This should output:

Home-Alone -> https://images.brickset.com/sets/images/21330-1.jpg?202109060933
Hayabusa -> https://images.brickset.com/sets/images/21101-1.jpg?201201150457