How to get all products from all categories-CodePudding

Could anyone assist me with my code I am trying to scrape products and prices from a patisserie website however it only retrieves the products on the main page. The rest of the products which are classified in categories have the same tags and attributes however when I run my code only products on the main page only appear. Here is my code;

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

cakes = []

url = "https://mrbakeregypt.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
articles = soup.find_all("div", class_="grid-view-item product-card")

for item in articles:
    product = item.find("div", class_="h4 grid-view-item__title product- 
    card__title").text
    price_regular = item.find("div", class_="price__regular").text.strip().replace('\n', 
    '')

    item_cost = {"name": product,
                 "cost": price_regular

                 }
    `[![enter code here][1]][1]`cakes.append(item_cost)

CodePudding user response：

As mentioned you have to process all collections / categories and one approache could be to collect the links from your baseUrl - Note I used a set comprehension to get only unique urls and avoid to iterate the same categorie more than one time:

urlList = list(set(baseUrl a['href'] for a in soup.select('a[href*="collection"]')))

Now you could itarate this urlList to scrape your informations:

...
for url in urlList:
    
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(requests.get(url).content)
    
    articles = soup.find_all("div", class_="grid-view-item product-card")
...

Example

Take a look it also handles the type / categorie of product and both prices, so you could filter based on these in your dataframe

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

baseUrl = "https://mrbakeregypt.com"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(baseUrl, headers=headers)
soup = BeautifulSoup(requests.get(baseUrl).content)

urlList = list(set(baseUrl a['href'] for a in soup.select('a[href*="collection"]')))

data = []

for url in urlList:
    
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(requests.get(url).content)
    
    articles = soup.find_all("div", class_="grid-view-item product-card")

    for item in articles:
        
        data.append({
            'name': item.a.text.strip(),
            'price_regular': item.find("div", class_="price__regular").dd.text.split()[-1].strip(),
            'price_sale': item.find("div", class_="price__sale").dd.text.split()[-1].strip(),
            'type': url.split('/')[-1],
            'url': baseUrl item.a.get('href')
        })
df = pd.DataFrame(data)

Output

	name	price_regular	price_sale	type	url
0	Mini Sandwiches Mix - 20 Pieces Bread Basket	402	402	sandwiches	https://mrbakeregypt.com/collections/sandwiches/products/mini-sandwiches-mix-bread-basket
1	Spiced Aubergine Mini Sandwiches - Box 2 Pieces	35	35	sandwiches	https://mrbakeregypt.com/collections/sandwiches/products/spiced-aubergine-mini-sandwich
2	Tuna Mini Sandwiches - Box 2 Pieces	49	49	sandwiches	https://mrbakeregypt.com/collections/sandwiches/products/tuna-mini-sandwich
3	Turkey Coleslaw Mini Sandwiches - Box 2 Pieces	45	45	sandwiches	https://mrbakeregypt.com/collections/sandwiches/products/turkey-coleslaw-mini-sandwich
4	Roast Beef Mini Sandwiches - Box 2 Pieces	49	49	sandwiches	https://mrbakeregypt.com/collections/sandwiches/products/roast-beef-mini-sandwich

...