Could anyone assist me with my code I am trying to scrape products and prices from a patisserie website however it only retrieves the products on the main page. The rest of the products which are classified in categories have the same tags and attributes however when I run my code only products on the main page only appear. Here is my code;
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
cakes = []
url = "https://mrbakeregypt.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
articles = soup.find_all("div", class_="grid-view-item product-card")
for item in articles:
product = item.find("div", class_="h4 grid-view-item__title product-
card__title").text
price_regular = item.find("div", class_="price__regular").text.strip().replace('\n',
'')
item_cost = {"name": product,
"cost": price_regular
}
`[![enter code here][1]][1]`cakes.append(item_cost)
CodePudding user response:
As mentioned you have to process all collections / categories and one approache could be to collect the links from your baseUrl
- Note I used a set comprehension
to get only unique urls and avoid to iterate the same categorie more than one time:
urlList = list(set(baseUrl a['href'] for a in soup.select('a[href*="collection"]')))
Now you could itarate this urlList
to scrape your informations:
...
for url in urlList:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content)
articles = soup.find_all("div", class_="grid-view-item product-card")
...
Example
Take a look it also handles the type / categorie of product and both prices, so you could filter based on these in your dataframe
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
baseUrl = "https://mrbakeregypt.com"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(baseUrl, headers=headers)
soup = BeautifulSoup(requests.get(baseUrl).content)
urlList = list(set(baseUrl a['href'] for a in soup.select('a[href*="collection"]')))
data = []
for url in urlList:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(requests.get(url).content)
articles = soup.find_all("div", class_="grid-view-item product-card")
for item in articles:
data.append({
'name': item.a.text.strip(),
'price_regular': item.find("div", class_="price__regular").dd.text.split()[-1].strip(),
'price_sale': item.find("div", class_="price__sale").dd.text.split()[-1].strip(),
'type': url.split('/')[-1],
'url': baseUrl item.a.get('href')
})
df = pd.DataFrame(data)
Output
name | price_regular | price_sale | type | url | |
---|---|---|---|---|---|
0 | Mini Sandwiches Mix - 20 Pieces Bread Basket | 402 | 402 | sandwiches | https://mrbakeregypt.com/collections/sandwiches/products/mini-sandwiches-mix-bread-basket |
1 | Spiced Aubergine Mini Sandwiches - Box 2 Pieces | 35 | 35 | sandwiches | https://mrbakeregypt.com/collections/sandwiches/products/spiced-aubergine-mini-sandwich |
2 | Tuna Mini Sandwiches - Box 2 Pieces | 49 | 49 | sandwiches | https://mrbakeregypt.com/collections/sandwiches/products/tuna-mini-sandwich |
3 | Turkey Coleslaw Mini Sandwiches - Box 2 Pieces | 45 | 45 | sandwiches | https://mrbakeregypt.com/collections/sandwiches/products/turkey-coleslaw-mini-sandwich |
4 | Roast Beef Mini Sandwiches - Box 2 Pieces | 49 | 49 | sandwiches | https://mrbakeregypt.com/collections/sandwiches/products/roast-beef-mini-sandwich |
...