I am scraping a Prestashop website where I want to get a list URLs for all images of a product. However, I am getting duplicate values (all links repeat themselves). I have tried creating a dictionary to remove duplicates, but it does not seem to work. Also I can't seem to remove the span tags from the reference number (unwrap does not work) - it keeps returning 'None' attribute, which is confusing, because all products have a reference number. I have tried turning the result into a string, but it does not let me.
Here is the code:
testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'
r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
imagelinks = []
name = soup.find('h1', class_='product_name').text.strip()
reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span')
images = soup.find_all('li', class_='thumb-container')
for item in images:
image = item.find('img').attrs['src']
imagelinks.append(image)
print(imagelinks)
CodePudding user response:
Use .text
to get number without tag <span>
reference_number = reference.find('span').text
Use set()
instead of list to skip duplicate items
imagelinks = set()
# ...
imagelinks.add(image)
Full working code:
import requests
from bs4 import BeautifulSoup
testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'
r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
imagelinks = set()
name = soup.find('h1', class_='product_name').text.strip()
reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span').text
print(reference_number)
images = soup.find_all('li', class_='thumb-container')
for item in images:
image = item.find('img').attrs['src']
imagelinks.add(image)
print(imagelinks)
print('len:', len(imagelinks))
EDIT:
OR you should get images only from <div id="thumb_box">
using find().find_all()
images = soup.find('div', {'id':'thumb_box'}).find_all('li', class_='thumb-container')
or using CSS selector
images = soup.select('div#thumb_box li.thumb-container')
import requests
from bs4 import BeautifulSoup
testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'
r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
imagelinks = []
name = soup.find('h1', class_='product_name').text.strip()
reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span').text
print(reference_number)
images = soup.find('div', {'id':'thumb_box'}).find_all('li', class_='thumb-container')
#images = soup.select('div#thumb_box li.thumb-container')
for item in images:
image = item.find('img').attrs['src']
imagelinks.append(image)
print(imagelinks)
print('len:', len(imagelinks))