How to scrape part of the name of the class (BS4, Python)-CodePudding

Hello, I want to scrape part of the name of the class from the screenshot above.

<li >

I would like to scrape 'product_cat-lozyska', but this part can be different due to lots of records, for example 'product_cat-uszczelnienia' (bold part is always there). I want to scrape all of these product_cat-'s.

Example of code with scraping other things from HTML:

products = soup.find('ul', {'class':re.compile('^products')}).find_all('li')

    
    for product in products:
        try:
            productName = product.find('span',{'class':'sku'}).text
        except:
            productName = 'none'

Could you help please?

CodePudding user response：

In case you like to scrape all the class names that starts with product_cat- you could do that with following comprehension - It iterates over your products, pick the values of class as list, iterate it and only return these names that startswith() your pattern.

Note: comprehension is based on a set, so you will avoid duplicate class names:

set(c for p in products for c in p.get('class') if c.startswith('product_cat-'))

to get only the cat itself:

set(c.split('product_cat-')[-1] for p in products for c in p.get('class') if c.startswith('product_cat-'))

Example

import requests
from bs4 import BeautifulSoup

r = requests.get('https://specjal.com/sklep/')
soup = BeautifulSoup(r.content)

products = soup.select('ul.products li')

set(c.split('product_cat-')[-1] for p in products for c in p.get('class') if c.startswith('product_cat-'))

Output:

{'lancuchy-i-kola-lancuchowe',
 'lozyska',
 'pasy-napedowe',
 'uncategorized',
 'uszczelnienia-hydrauliki-silowej'}

CodePudding user response：

You could use a lambda function and use startsWith function, like so:

products = soup.findAll("li", {"class" : lambda L: L and L.startswith('product_cat-')})

or alternatively use regular expression, like the following:

products = soup.findAll("li", {"class" : re.compile('product_cat-.*')})