I have a list of urls that all have the same first part of the url. All the urls have 'ingredient-disclosure' with the product category coming after seperated by a /. I want to create a list that contains all the product categories.
So for the given url, I want to grab the text 'commercial-professional' and store it in a list that contains all the product categories.
Here is one of the urls: https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx
Thank you for any help!
CodePudding user response:
You split the urls on the "/" character and get whatever you need from the resulting list:
prod_cat_list = []
url = 'https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx'
parts = url.split('/')
domain = parts[2]
prod_category = parts[4]
prod_cat_list.append(prod_category)
print(prod_cat_list)
CodePudding user response:
You might want to consider using a Python set to store the categories so you end up with one of each.
Try the following example that uses their index page to get possible links:
import requests
from bs4 import BeautifulSoup
import csv
url = "https://churchdwight.com/ingredient-disclosure/"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
categories = set()
for a_tag in soup.find_all("a", href=True):
url_parts = [p for p in a_tag["href"].split('/') if p]
if len(url_parts) > 2 and url_parts[0] == "ingredient-disclosure":
categories.update([url_parts[1]])
print("\n".join(sorted(categories)))
This would give you the following categories:
Nausea-Relief
antiperspirant-deodorant
cleaning-products
commercial-professional
cough-allergy
dental-care
depilatories
fabric-softener-sheets
feminine-hygiene
hair-care
hand-sanitizer
hemorrhoid-relief
laundry-fabric-care
nasal-care
oral-care
pain-relief
pet-care
pool-products
sexual-health
skin-care
wound-care