Home > Back-end >  Getting duplicate links for images in scraping with BeautifulSoup
Getting duplicate links for images in scraping with BeautifulSoup

Time:12-11

I am scraping a Prestashop website where I want to get a list URLs for all images of a product. However, I am getting duplicate values (all links repeat themselves). I have tried creating a dictionary to remove duplicates, but it does not seem to work. Also I can't seem to remove the span tags from the reference number (unwrap does not work) - it keeps returning 'None' attribute, which is confusing, because all products have a reference number. I have tried turning the result into a string, but it does not let me.

Here is the code:

testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'

r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
imagelinks = []
name = soup.find('h1', class_='product_name').text.strip()
reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span')
images = soup.find_all('li', class_='thumb-container')
for item in images:
    image = item.find('img').attrs['src']
    imagelinks.append(image)
print(imagelinks)

CodePudding user response:

Use .text to get number without tag <span>

reference_number = reference.find('span').text

Use set() instead of list to skip duplicate items

imagelinks = set()

# ... 

    imagelinks.add(image)

Full working code:

import requests
from bs4 import BeautifulSoup

testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'

r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')

imagelinks = set()

name = soup.find('h1', class_='product_name').text.strip()

reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span').text
print(reference_number)

images = soup.find_all('li', class_='thumb-container')

for item in images:
    image = item.find('img').attrs['src']
    imagelinks.add(image)

print(imagelinks)
print('len:', len(imagelinks))


EDIT:

OR you should get images only from <div id="thumb_box">

using find().find_all()

images = soup.find('div', {'id':'thumb_box'}).find_all('li', class_='thumb-container')

or using CSS selector

images = soup.select('div#thumb_box li.thumb-container')

import requests
from bs4 import BeautifulSoup

testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'

r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')

imagelinks = []

name = soup.find('h1', class_='product_name').text.strip()

reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span').text
print(reference_number)

images = soup.find('div', {'id':'thumb_box'}).find_all('li', class_='thumb-container')
#images = soup.select('div#thumb_box li.thumb-container')

for item in images:
    image = item.find('img').attrs['src']
    imagelinks.append(image)

print(imagelinks)
print('len:', len(imagelinks))
  • Related