Currently working on a project, my goal is to create a scraper to check only the size available of each item with bs4
Website of interest: https://www.6pm.com/p/gbg-los-angeles-ayvie-slate/product/9479982/color/642?zlfid=192&ref=pd_detail_1_sims_cv
I’m trying to extract only the available size without showing the size that are not available.
What i have done :
size = soup.find('div', {"class": "zna-z Ana-z"}).text
print(size)
Return : 5.56.577.588.5
and when i try this one
size=soup.find('div', {"class": "dqa-z"}).text
Return : 5.5
My expected return is to get only available size like “ 6.57 “ ( size6.5 and size7) because there are the one available.
CodePudding user response:
You could filter the elements by attribute value of aria-label
used css selectors
with pseudo class :not()
here:
[s.get('data-label') for s in soup.select('input[data-track-label="size"]:not([aria-label*="Out of Stock"])')]
Also recommend to use static information like id, html structure, attributes over dynamic like classes to select your elements.
Example
from bs4 import BeautifulSoup
html = '''<div ><div ><input type="radio" id="radio-3100-9479982" aria-label="Size 5.5 is Out of Stock" name="d3" value="3100" data-label="5.5" data-track-label="size"><label for="radio-3100-9479982">5.5</label></div><div ><input type="radio" id="radio-3102-9479982" aria-label="Size 6.5" name="d3" value="3102" data-label="6.5" data-track-label="size"><label for="radio-3102-9479982">6.5</label></div><div ><input type="radio" id="radio-3103-9479982" aria-label="Size 7" name="d3" value="3103" data-label="7" data-track-label="size"><label for="radio-3103-9479982">7</label></div><div ><input type="radio" id="radio-3104-9479982" aria-label="Size 7.5 is Out of Stock" name="d3" value="3104" data-label="7.5" data-track-label="size"><label for="radio-3104-9479982">7.5</label></div><div ><input type="radio" id="radio-3105-9479982" aria-label="Size 8 is Out of Stock" name="d3" value="3105" data-label="8" data-track-label="size"><label for="radio-3105-9479982">8</label></div><div ><input type="radio" id="radio-3106-9479982" aria-label="Size 8.5 is Out of Stock" name="d3" value="3106" data-label="8.5" data-track-label="size"><label for="radio-3106-9479982">8.5</label></div></div>'''
soup = BeautifulSoup(html)
[s.get('data-label') for s in soup.select('input[data-track-label="size"]:not([aria-label*="Out of Stock"])')]
Output
['6.5', '7']
CodePudding user response:
you can use css :not()
mdn since all unavailable sizes have the eqa-z
class.
html = """<div ><div ><input type="radio" id="radio-3100-9479982" aria-label="Size 5.5 is Out of Stock" name="d3" value="3100" data-label="5.5" data-track-label="size"><label for="radio-3100-9479982">5.5</label></div><div ><input type="radio" id="radio-3102-9479982" aria-label="Size 6.5" name="d3" value="3102" data-label="6.5" data-track-label="size"><label for="radio-3102-9479982">6.5</label></div><div ><input type="radio" id="radio-3103-9479982" aria-label="Size 7" name="d3" value="3103" data-label="7" data-track-label="size"><label for="radio-3103-9479982">7</label></div><div ><input type="radio" id="radio-3104-9479982" aria-label="Size 7.5 is Out of Stock" name="d3" value="3104" data-label="7.5" data-track-label="size"><label for="radio-3104-9479982">7.5</label></div><div ><input type="radio" id="radio-3105-9479982" aria-label="Size 8 is Out of Stock" name="d3" value="3105" data-label="8" data-track-label="size"><label for="radio-3105-9479982">8</label></div><div ><input type="radio" id="radio-3106-9479982" aria-label="Size 8.5 is Out of Stock" name="d3" value="3106" data-label="8.5" data-track-label="size"><label for="radio-3106-9479982">8.5</label></div></div>"""
soup = BeautifulSoup(html, "lxml")
sizes = [tag.text for tag in soup.select(".dqa-z:not(.eqa-z) label")]
>>> ['6.5', '7']