I'm very new to this and have spent hours trying various methods I've read here. Apologies if I'm making some silly mistake
I want to create a database of my LEGO sets. Pulling images and info from brickset.com
I'm using:
anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
anchors = [a.get_attribute('href') for a in anchors]
print (anchors) returns:
anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
What I'm trying to target:
div id="ui-tabs-2" aria-live="polite" aria-labelledby="ui-id-4" role="tabpanel" aria-expanded="true" aria-hidden="false" style="display: block;">
<ul >
<li>
<a href="https://images.brickset.com/sets/AdditionalImages/21054-1/21054_alt10.jpg" onclick="return hs.expand(this)">
<img src="https://images.brickset.com/sets/AdditionalImages/21054-1/tn_21054_alt10_jpg.jpg" title="" one rror="this.src='/assets/images/spacer2.png'" loading="lazy">
</a><div >
I'm losing my mind trying to figure this out.
Update Still not getting the href attributes. To add more detail, I'm trying to get the images under the "images" tab on this URL: https://brickset.com/sets/21330-1/Home-Alone Here is the problematic code:
anchors = driver.find_elements(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [anchors.get_attribute('href') for a in anchors]
print('Found ' str(len(anchors)) ' links to images')
I've also tried:
#anchors = driver.find_elements_by_css_selector("a[href*='21330']")
This only returned one href, even though there should be about a dozen.
Thank you all for the assistance!
CodePudding user response:
You shouldn't be using the same name for multiple variables.
As per the first line of code:
anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
anchors
is the list of WebElements. Ideally to create another list with the href
attributes you should use another name, e.g. hrefs
Effectively your code block will be:
anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
hrefs = [a.get_attribute('href') for a in anchors]
print(hrefs)
Using list comprehension in a single line:
print(a.get_attribute('href') for a in driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a'))
CodePudding user response:
First thing, driver.find_elements_by_xpath
is deprecated, use driver.find_element(By.XPATH, 'locator')
instead.
Now, if you'd like to get all href
s of the links on the page:
elements = driver.find_element(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [element.get_attribute('href') for element in elements]
Notice that I'm not using [1]
to get a single element, but rather all elements.
CodePudding user response:
You might want to try this.
NOTE: I'm not using selenium
here.
import time
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0",
}
sample_urls = [
"https://brickset.com/sets/21330-1/Home-Alone",
"https://brickset.com/sets/21101-1/Hayabusa"
]
with requests.Session() as s:
for sample_url in sample_urls:
ajax_setID = [
a["href"] for a in
BeautifulSoup(s.get(sample_url, headers=headers).text, "lxml").find_all("a")
if "mainImage" in a["href"]
][0]
image_url = f"https://brickset.com{ajax_setID}&_{int(time.time() * 1000)}"
headers.update(
{
"Referer": sample_url,
"X-Requested-With": "XMLHttpRequest",
}
)
source_image = (
BeautifulSoup(
s.get(image_url, headers=headers).text, "lxml"
).find("img")["src"]
)
print(f"{sample_url.split('/', -1)[-1]} -> {source_image}")
This should output:
Home-Alone -> https://images.brickset.com/sets/images/21330-1.jpg?202109060933
Hayabusa -> https://images.brickset.com/sets/images/21101-1.jpg?201201150457