Using BeautifulSoup to grab the second part of a URL and store that text in a variable-CodePudding

I have a list of urls that all have the same first part of the url. All the urls have 'ingredient-disclosure' with the product category coming after seperated by a /. I want to create a list that contains all the product categories.

So for the given url, I want to grab the text 'commercial-professional' and store it in a list that contains all the product categories.

Here is one of the urls: https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx

Thank you for any help!

CodePudding user response：

You split the urls on the "/" character and get whatever you need from the resulting list:

prod_cat_list = []
url = 'https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx'

parts = url.split('/')
domain = parts[2]
prod_category = parts[4]

prod_cat_list.append(prod_category)

print(prod_cat_list)

CodePudding user response：

You might want to consider using a Python set to store the categories so you end up with one of each.

Try the following example that uses their index page to get possible links:

import requests
from bs4 import BeautifulSoup
import csv

url = "https://churchdwight.com/ingredient-disclosure/"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")

categories = set()
    
for a_tag in soup.find_all("a", href=True):
    url_parts = [p for p in a_tag["href"].split('/') if p]

    if len(url_parts) > 2 and url_parts[0] == "ingredient-disclosure":
        categories.update([url_parts[1]])

print("\n".join(sorted(categories)))

This would give you the following categories:

Nausea-Relief
antiperspirant-deodorant
cleaning-products
commercial-professional
cough-allergy
dental-care
depilatories
fabric-softener-sheets
feminine-hygiene
hair-care
hand-sanitizer
hemorrhoid-relief
laundry-fabric-care
nasal-care
oral-care
pain-relief
pet-care
pool-products
sexual-health
skin-care
wound-care