Home > front end >  Adapt python web scraper to filter list of image URLs
Adapt python web scraper to filter list of image URLs

Time:03-20

I have minimal coding knowledge, and am trying to adapt some code previouly written for me with limited success, based on changes to the source website structure.

Expected results: Scrape a list of image URLs, filtered by;

  1. media="(min-width: 740px)"
  2. type="image/jpeg"

Actual results: Current results include duplicate images (jpg/webp) and 3 copies of each image at different resolutions.

Sample URL: https://www.costco.com.au/ORAL-B/Oral-B-Smart-5000-Dual-Handle-Electric-Toothbrush/p/46168

Sample source snippet:

<source srcset="/medias/sys_master/images/h86/h15/52677239210014.jpg" media="(min-width: 740px)" type="image/jpeg" ><source srcset="/medias/sys_master/images/hc8/h9f/79649061568542.webp" media="(min-width: 350px)" type="image/webp" ><source srcset="/medias/sys_master/images/h9b/h2c/52677239013406.jpg" media="(min-width: 350px)" type="image/jpeg" ><source srcset="/medias/sys_master/images/h3f/h5c/79649061830686.webp" media="(min-width: 160px)" type="image/webp" ><source srcset="/medias/sys_master/images/h2f/hb3/52677239078942.jpg" media="(min-width: 160px)" type="image/jpeg" ><source srcset="/medias/sys_master/images/h48/h69/79649060978718.webp" media="(min-width: 0px)" type="image/webp"

Segment of code that works:

import pandas as pd
import requests
import csv
from bs4 import BeautifulSoup

box = []

with open('Source_url_180322.csv') as csv_file:
    csv_reader = csv.reader(csv_file)

    for line in csv_reader:
        print(line)

        for i in line:
            
            r = requests.get(i)

...

imgs = soup.find_all('div', class_='image-zoom-container')
            
            l = [] 
            for item in imgs:
                for link in item.find_all(class_='ng-star-inserted'):

                    b = link.get('srcset')

                        l.append(b)

I've tried various combinations of the below to add filters, but any assistance you could provide would be much appreciated.

            divs = soup.find_all(media='(min-width: 740px)')

                            for div in divs:
                                 l.append(b)

Current output (this format is perfect once above filtering is added):

https://www.costco.com.au/medias/sys_master/images/he3/hbc/79648528498718.webp,https://www.costco.com.au/medias/sys_master/images/hbd/h1c/51660195430430.jpg,https://www.costco.com.au/medias/sys_master/images/h3e/h52/79648528891934.webp .etc

CodePudding user response:

To get a list of all the expected images from a page you could use list comprehension in combination with css selectors as one approach, but be aware you have to concat srcset value with baseUrl:

['https://www.costco.com.au' s['srcset'] for s in soup.select('source[media="(min-width: 740px)"][type="image/jpeg"]')]

Or as mentioned in the comments in two steps:

#select all relevant source tags
imgl = soup.select('source[media="(min-width: 740px)"][type="image/jpeg"]')

#extract srcset from each source tag in imgl and concat with baseUrl
data = ['https://www.costco.com.au' s['srcset'] for s in imgl]

Example

from bs4 import BeautifulSoup
import requests

url = 'https://www.costco.com.au/ORAL-B/Oral-B-Smart-5000-Dual-Handle-Electric-Toothbrush/p/46168'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

data = ['https://www.costco.com.au' s['srcset'] for s in soup.select('source[media="(min-width: 740px)"][type="image/jpeg"]')]

print(data)

Output

['https://www.costco.com.au/medias/sys_master/images/h35/hce/17573589188638.jpg',
 'https://www.costco.com.au/medias/sys_master/images/h0b/ha2/17573590204446.jpg',
 'https://www.costco.com.au/medias/sys_master/images/h2a/h1b/17573589352478.jpg',
 'https://www.costco.com.au/medias/sys_master/images/h02/h79/17573590073374.jpg',
 'https://www.costco.com.au/medias/sys_master/images/hb6/h4d/17573589418014.jpg',
 'https://www.costco.com.au/medias/sys_master/images/h1d/h34/17573590138910.jpg']

CodePudding user response:

Try this:

results = []
for img in imgs:
    for link in img.find_all(class_='ng-star-inserted'):
        source = link.get('srcset')
        media = link.get("media")
        type = link.get("type")

        if source and media == "(min-width: 740px)" and type == "image/jpeg":
            results.append("https://www.costco.com.au"   source)

print(results)
  • Related