Home > Back-end >  How to use bs4 to collect the size available only?
How to use bs4 to collect the size available only?

Time:01-10

Currently working on a project, my goal is to create a scraper to check only the size available of each item with bs4

Website of interest: https://www.6pm.com/p/gbg-los-angeles-ayvie-slate/product/9479982/color/642?zlfid=192&ref=pd_detail_1_sims_cv

I’m trying to extract only the available size without showing the size that are not available.

enter image description here

What i have done :

size = soup.find('div', {"class": "zna-z Ana-z"}).text

print(size)

Return : 5.56.577.588.5

and when i try this one

size=soup.find('div', {"class": "dqa-z"}).text

Return : 5.5

My expected return is to get only available size like “ 6.57 “ ( size6.5 and size7) because there are the one available.

CodePudding user response:

You could filter the elements by attribute value of aria-label used css selectors with pseudo class :not() here:

[s.get('data-label') for s in soup.select('input[data-track-label="size"]:not([aria-label*="Out of Stock"])')]

Also recommend to use static information like id, html structure, attributes over dynamic like classes to select your elements.

Example

from bs4 import BeautifulSoup
html = '''<div ><div ><input type="radio" id="radio-3100-9479982" aria-label="Size 5.5 is Out of Stock" name="d3" value="3100" data-label="5.5" data-track-label="size"><label for="radio-3100-9479982">5.5</label></div><div ><input type="radio" id="radio-3102-9479982" aria-label="Size 6.5" name="d3" value="3102" data-label="6.5" data-track-label="size"><label for="radio-3102-9479982">6.5</label></div><div ><input type="radio" id="radio-3103-9479982" aria-label="Size 7" name="d3" value="3103" data-label="7" data-track-label="size"><label for="radio-3103-9479982">7</label></div><div ><input type="radio" id="radio-3104-9479982" aria-label="Size 7.5 is Out of Stock" name="d3" value="3104" data-label="7.5" data-track-label="size"><label for="radio-3104-9479982">7.5</label></div><div ><input type="radio" id="radio-3105-9479982" aria-label="Size 8 is Out of Stock" name="d3" value="3105" data-label="8" data-track-label="size"><label for="radio-3105-9479982">8</label></div><div ><input type="radio" id="radio-3106-9479982" aria-label="Size 8.5 is Out of Stock" name="d3" value="3106" data-label="8.5" data-track-label="size"><label for="radio-3106-9479982">8.5</label></div></div>'''
soup = BeautifulSoup(html)

[s.get('data-label') for s in soup.select('input[data-track-label="size"]:not([aria-label*="Out of Stock"])')]

Output

['6.5', '7']

CodePudding user response:

you can use css :not() mdn since all unavailable sizes have the eqa-z class.

html = """<div ><div ><input type="radio" id="radio-3100-9479982" aria-label="Size 5.5 is Out of Stock" name="d3" value="3100" data-label="5.5" data-track-label="size"><label for="radio-3100-9479982">5.5</label></div><div ><input type="radio" id="radio-3102-9479982" aria-label="Size 6.5" name="d3" value="3102" data-label="6.5" data-track-label="size"><label for="radio-3102-9479982">6.5</label></div><div ><input type="radio" id="radio-3103-9479982" aria-label="Size 7" name="d3" value="3103" data-label="7" data-track-label="size"><label for="radio-3103-9479982">7</label></div><div ><input type="radio" id="radio-3104-9479982" aria-label="Size 7.5 is Out of Stock" name="d3" value="3104" data-label="7.5" data-track-label="size"><label for="radio-3104-9479982">7.5</label></div><div ><input type="radio" id="radio-3105-9479982" aria-label="Size 8 is Out of Stock" name="d3" value="3105" data-label="8" data-track-label="size"><label for="radio-3105-9479982">8</label></div><div ><input type="radio" id="radio-3106-9479982" aria-label="Size 8.5 is Out of Stock" name="d3" value="3106" data-label="8.5" data-track-label="size"><label for="radio-3106-9479982">8.5</label></div></div>"""
soup = BeautifulSoup(html, "lxml")

sizes = [tag.text for tag in soup.select(".dqa-z:not(.eqa-z) label")]
>>> ['6.5', '7']
  • Related