While scraping product info from
CodePudding user response:
In case you like to scrape all the class names that starts with product_cat-
you could do that with following comprehension
- It iterates over your products, pick the values of class
as list, iterate it and only return these names that startswith()
your pattern.
Note: comprehension
is based on a set, so you will avoid duplicate class names:
set(c.split('product_cat-')[-1] for p in products for c in p.get('class') if c.startswith('product_cat-'))
#output
{'pasy-napedowe', 'uszczelnienia-hydrauliki-silowej', 'lozyska', 'lancuchy-i-kola-lancuchowe', 'uncategorized'}
Including above approach in to your process, to get class information for each product, you could use next()
to iterate the class names:
cat = next(c.split('product_cat-')[-1] for c in product.get('class') if c.startswith('product_cat-'))
Example
...
products = soup.find('ul', {'class':re.compile('^products')}).find_all('li')
data = []
for product in products:
try:
productName = product.find('span',{'class':'sku'}).text
except:
productName = 'none'
try:
cat = next(c.split('product_cat-')[-1] for c in product.get('class') if c.startswith('product_cat-'))
except:
cat = 'none'
data.append({
'productName':productName,
'cat':cat
})
df = pd.DataFrame(data)
df.to_csv('products.csv', index=False)
Output
productName | cat |
---|---|
ZZ 901054 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 851005 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 80954 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 75904 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 70854 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 65805 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 65804 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 60755 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 55654 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 50604 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 45554 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 40504 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 35454 VAY | uszczelnienia-hydrauliki-silowej |
ZZ 30404 VAY | uszczelnienia-hydrauliki-silowej |
XPA 710 CT | pasy-napedowe |
UCP 202 KBF | lozyska |
U298/U291 SET9 | lozyska |
CodePudding user response:
You could use a lambda function and use startsWith function, like so:
products = soup.findAll("li", {"class" : lambda L: L and L.startswith('product_cat-')})
or alternatively use regular expression, like the following:
products = soup.findAll("li", {"class" : re.compile('product_cat-.*')})