I am trying to scrape some data from a website. However, the data that I am interested in is stored in one-pager landingpages, where the URL changes based on the company name.
I first created a loop where I scraped all company names from the "front page", and then assigned them to a list, url_list:
url_list= []
for page in range(1,76): # 94 is max; though I suspect you might get blocked by host
req = requests.get("https://proteindirectory.com/alt-protein-database/?_protein_category=plant-based&_load_more=" str(page), headers=headers)
soup = BeautifulSoup(req.text, 'html.parser')
for span in soup.find_all(id='span-1117-390'):
url_list.append(span.text)
url_list = [e.replace(" ", "-") for e in url_list]
url_list = [a.replace("&", "") for a in url_list]
Afterwards, I tried to create another list, where I apply the url_list as a tag, where each company name should be applied in the target URL. However, i get an empty list, so something is wrong with my code:
companyList = []
def getCompanies(url_list):
url= f'https://proteindirectory.com/company/[url_list]'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
company = soup.find_all('div', {'id': 'div_block-6-1850', 'class': 'ct-div-block small-text'})
companyName = soup.find_all('section', {'class': ' ct-section', 'id': 'section-2-1850'})
for item in company or companyName:
companies = {
'name': item.find('span', {'class': 'ct-span', 'id': 'span-11-1850'}).text,
'primaryFocus': item.find('span', {'class': 'ct-span', 'id': 'span-1554-1850'}).text,
'location': item.find('span', {'class': 'ct-span', 'id': 'span-41-1850'}).text,
'founded': item.find('span', {'class': 'ct-span', 'id': 'span-1532-1850'}).text,
'website': item.find('span', {'class': 'ct-span', 'id': 'span-61-1850'}).text,
'businessModel': item.find('span', {'class': 'ct-span', 'id': 'span-44-1850'}).text,
'proteinCategory': item.find('span', {'class': 'ct-span', 'id': 'span-1625-1850'}).text,
'ingredients': item.find('span', {'class': 'ct-span', 'id': 'span-1664-1850'}).text,
'endProductApplication': item.find('span', {'class': 'ct-span', 'id': 'span-1621-1850'}).text,
}
companyList.append(companies)
return
getCompanies(url_list)
print(companyList)
Hope someone can help a newbie out :-)
CodePudding user response:
https://proteindirectory.com/company/[url_list]
is not a site address. Also, you should go after the actual href in the <a>
tag, as opposed to trying to hard code the url pattern from the span-1117-390
elements you are pulling.
Next, you're going to want to iterate through the list of urls in a for loop like you did with the pages. I only went through the first 2 pages, but try this:
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url_list= []
for page in range(1,76): # 94 is max; though I suspect you might get blocked by host
print(page)
req = requests.get("https://proteindirectory.com/alt-protein-database/?_protein_category=plant-based&_load_more=" str(page), headers=headers)
soup = BeautifulSoup(req.text, 'html.parser')
for a in soup.find_all('a',id='div_block-7-390', href=True):
url_list.append(a['href'])
companyList = []
def getCompanies(url):
print(url)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
company = soup.find_all('div', {'id': 'div_block-6-1850', 'class': 'ct-div-block small-text'})
for item in company:
try:
name = soup.find('span', {'class': 'ct-span', 'id': 'span-11-1850'}).text
except:
name = 'N/A'
try:
primaryFocus = item.find('span', {'class': 'ct-span', 'id': 'span-1554-1850'}).text
except:
primaryFocus = 'N/A'
try:
location = item.find('span', {'class': 'ct-span', 'id': 'span-41-1850'}).text
except:
location = 'N/A'
try:
founded = item.find('span', {'class': 'ct-span', 'id': 'span-1532-1850'}).text
except:
founded = 'N/A'
try:
website = item.find('span', {'class': 'ct-span', 'id': 'span-61-1850'}).text
except:
website = 'N/A'
try:
businessModel = item.find('span', {'class': 'ct-span', 'id': 'span-44-1850'}).text
except:
businessModel = 'N/A'
try:
proteinCategory = item.find('span', {'class': 'ct-span', 'id': 'span-1625-1850'}).text
except:
proteinCategory = 'N/A'
try:
ingredients = item.find('span', {'class': 'ct-span', 'id': 'span-1664-1850'}).text
except:
ingredients = 'N/A'
try:
endProductApplication = item.find('span', {'class': 'ct-span', 'id': 'span-1621-1850'}).text
except:
endProductApplication = 'N/A'
companies = {
'name': name,
'primaryFocus': primaryFocus,
'location': location,
'founded': founded,
'website': website,
'businessModel': businessModel,
'proteinCategory': proteinCategory,
'ingredients': ingredients,
'endProductApplication': endProductApplication
}
companyList.append(companies)
for url in url_list:
getCompanies(url)
print(companyList)
df = pd.DataFrame(companyList)
Output:
print(df.to_string())
name primaryFocus location founded website businessModel proteinCategory ingredients endProductApplication
0 New Barn Organics Food and beverages United States 2015 newbarnorganics.com End-consumer brands & products Plant-based Almond, Coconut Dairy, Milk
1 Plantstrong Food and beverages United States 2007 plantstrongfoods.com End-consumer brands & products Plant-based N/A Ready-to-eat meals & snacks
2 Pop & Bottle Food and beverages United States 2015 popandbottle.com End-consumer brands & products Plant-based Oat Dairy, Milk
3 Friedas Food and beverages United States 1962 friedas.com End-consumer brands & products Plant-based Soy Meat & fish, Sausage
4 Creations Foods Food and beverages United States 2019 creationsfoods.com End-consumer brands & products Plant-based N/A Ice-cream and desserts
5 Biocatalysts Ltd Food and beverages United Kingdom 1986 biocatalysts.com Ingredients & inputs Fermentation, Plant-based N/A N/A
6 Oterra Food and beverages Denmark oterra.com Ingredients & inputs Plant-based N/A N/A
7 Sydsel Africa Food and beverages Kenya 2015 sydselafrica.com Ingredients & inputs Plant-based Mushroom, Soy, Wheat, Yeast N/A
8 PhycoSystems Food and beverages Germany 2021 phycosystems.de Ingredients & inputs Plant-based Algae, Microalgae N/A
9 Meta Burger Food and beverages United States 2018 metaburger.com End-consumer brands & products Plant-based N/A Burger, Meat & fish
10 C-Merak Food and beverages Canada 2018 c-merak.ca Ingredients & inputs Plant-based Fava bean N/A
11 New Protein Global Food and beverages Canada newproteinglobal.com Ingredients & inputs Plant-based Soy N/A
12 Kagome Food and beverages United States 1989 kagomeusa.com Contract manufacturing, End-consumer brands & products Plant-based Sunflower Oils and fats
13 GK Foods Food and beverages United States 2020 gkfoods.co Contract manufacturing Plant-based N/A N/A
14 Global Food and Ingredients Inc. Food and beverages Canada 2018 gfiglobalfood.com Ingredients & inputs Plant-based Beans, Chickpea, Lentils, Pea N/A
15 CP Kelco Food and beverages United States 1929 cpkelco.com Ingredients & inputs Fermentation, Plant-based N/A N/A
16 Greenest Food and beverages India 2017 greenestfoods.com End-consumer brands & products Plant-based N/A Meat & fish
17 Montana Pure Protein Food and beverages United States 2020 montanapure.us Ingredients & inputs Plant-based Pulses N/A
18 Alghética Food and beverages Italy 2021 alghetica.com Ingredients & inputs Fermentation, Plant-based Algae N/A
19 Charoen Pokphand Foods Animal feed and pet food, Food and beverages Thailand cpfworldwide.com End-consumer brands & products, Ingredients & inputs Plant-based N/A Meat & fish
20 Dahmes Stainless, Inc. Food and beverages United States 1994 dahmes.com Infrastructure & equipment Plant-based N/A N/A
21 Shandong Wonderful Industrial Group Co., Ltd. Food and beverages China 2001 wandefugroup.com Ingredients & inputs Plant-based Soy N/A
22 Benson Hill Food and beverages United States 2012 bensonhill.com Ingredients & inputs Plant-based Pea, Soy N/A
23 Brookside Flavors & Ingredients Food and beverages United States 2015 brooksideflavors.com Ingredients & inputs Plant-based N/A N/A
24 Cereal Ingredients (CII) Food and beverages United States 1984 ciifoods.com Ingredients & inputs Plant-based Chickpea, Fava bean, Pea, Rice, Soy, Wheat N/A
25 Yantai T.Full Biotech Co. Ltd. Food and beverages China 2011 en.tfull.com Ingredients & inputs Plant-based Chickpea, Fava bean, Mung Bean, Pea N/A
26 Devigere biosolutions Pvt Ltd Food and beverages India 2020 devigerebiosolutions.in Ingredients & inputs Plant-based Pulses N/A
27 CHKP Foods Food and beverages Israel 2019 chkpfoods.com End-consumer brands & products Plant-based Chickpea Dairy, Yogurt
28 Ingredient Alliance Food and beverages United States 2017 linkedin.com Ingredients & inputs Plant-based N/A N/A
29 Yantai Shuangta Food co., LTD Food and beverages China 1992 shuangtafood.com Ingredients & inputs Plant-based Mushroom, Pea N/A
30 Shandong Jianyuan Bioengineering Co.,Ltd Food and beverages China 2003 jianyuangroup.com Ingredients & inputs Plant-based Pea N/A
31 Harvest B Food and beverages Australia 2020 harvestb.io Ingredients & inputs Plant-based N/A N/A
32 Ergo Bioscience Food and beverages Argentina 2020 ergofoods.com Ingredients & inputs Cultivated, Plant-based Carrots Dairy, Meat & fish
33 Living Jin Food and beverages United States 2016 livingjin.com End-consumer brands & products, Ingredients & inputs Plant-based Agar N/A
34 Vitmark Food and beverages Ukraine 1994 int.vitmark.com End-consumer brands & products Plant-based Almond, Oat, Rice Dairy, Milk
35 Its Veego Food and beverages Australia itsveego.com End-consumer brands & products Plant-based Coconut, Hemp, Pea Ready-to-eat meals & snacks