Home > Enterprise >  How to scrape substring from elements class names?
How to scrape substring from elements class names?

Time:04-03

While scraping product info from enter image description here

CodePudding user response:

In case you like to scrape all the class names that starts with product_cat- you could do that with following comprehension - It iterates over your products, pick the values of class as list, iterate it and only return these names that startswith() your pattern.

Note: comprehension is based on a set, so you will avoid duplicate class names:

set(c.split('product_cat-')[-1] for p in products for c in p.get('class') if c.startswith('product_cat-'))
#output
{'pasy-napedowe', 'uszczelnienia-hydrauliki-silowej', 'lozyska', 'lancuchy-i-kola-lancuchowe', 'uncategorized'}

Including above approach in to your process, to get class information for each product, you could use next() to iterate the class names:

cat = next(c.split('product_cat-')[-1] for c in product.get('class') if c.startswith('product_cat-'))

Example

...
products = soup.find('ul', {'class':re.compile('^products')}).find_all('li')

data = []

for product in products:
    try:
        productName = product.find('span',{'class':'sku'}).text
    except:
        productName = 'none'

    try:
        cat = next(c.split('product_cat-')[-1] for c in product.get('class') if c.startswith('product_cat-'))
    except:
        cat = 'none'

    data.append({
        'productName':productName,
        'cat':cat
    })
df = pd.DataFrame(data)
df.to_csv('products.csv', index=False)

Output

productName cat
ZZ 901054 VAY uszczelnienia-hydrauliki-silowej
ZZ 851005 VAY uszczelnienia-hydrauliki-silowej
ZZ 80954 VAY uszczelnienia-hydrauliki-silowej
ZZ 75904 VAY uszczelnienia-hydrauliki-silowej
ZZ 70854 VAY uszczelnienia-hydrauliki-silowej
ZZ 65805 VAY uszczelnienia-hydrauliki-silowej
ZZ 65804 VAY uszczelnienia-hydrauliki-silowej
ZZ 60755 VAY uszczelnienia-hydrauliki-silowej
ZZ 55654 VAY uszczelnienia-hydrauliki-silowej
ZZ 50604 VAY uszczelnienia-hydrauliki-silowej
ZZ 45554 VAY uszczelnienia-hydrauliki-silowej
ZZ 40504 VAY uszczelnienia-hydrauliki-silowej
ZZ 35454 VAY uszczelnienia-hydrauliki-silowej
ZZ 30404 VAY uszczelnienia-hydrauliki-silowej
XPA 710 CT pasy-napedowe
UCP 202 KBF lozyska
U298/U291 SET9 lozyska

CodePudding user response:

You could use a lambda function and use startsWith function, like so:

products = soup.findAll("li", {"class" : lambda L: L and L.startswith('product_cat-')})

or alternatively use regular expression, like the following:

products = soup.findAll("li", {"class" : re.compile('product_cat-.*')})
  • Related