Home > OS >  How to add a list to another list in order to extract data through data scraping
How to add a list to another list in order to extract data through data scraping

Time:03-11

I am trying to scrape some data from a website. However, the data that I am interested in is stored in one-pager landingpages, where the URL changes based on the company name.

I first created a loop where I scraped all company names from the "front page", and then assigned them to a list, url_list:

    url_list= []
    
    for page in range(1,76): # 94 is max; though I suspect you might get blocked by host
        req = requests.get("https://proteindirectory.com/alt-protein-database/?_protein_category=plant-based&_load_more="   str(page),  headers=headers)
        soup = BeautifulSoup(req.text, 'html.parser')
        for span in soup.find_all(id='span-1117-390'):
          url_list.append(span.text)
    
          url_list = [e.replace(" ", "-") for e in url_list]
          url_list = [a.replace("&", "") for a in url_list]

Afterwards, I tried to create another list, where I apply the url_list as a tag, where each company name should be applied in the target URL. However, i get an empty list, so something is wrong with my code:

companyList = []

def getCompanies(url_list):
    url= f'https://proteindirectory.com/company/[url_list]'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    company = soup.find_all('div', {'id': 'div_block-6-1850', 'class': 'ct-div-block small-text'})
    companyName = soup.find_all('section', {'class': ' ct-section', 'id': 'section-2-1850'})
    
    
    for item in company or companyName:
        companies = {
        'name': item.find('span', {'class': 'ct-span', 'id': 'span-11-1850'}).text,
        'primaryFocus': item.find('span', {'class': 'ct-span', 'id': 'span-1554-1850'}).text,
        'location': item.find('span', {'class': 'ct-span', 'id': 'span-41-1850'}).text,
        'founded': item.find('span', {'class': 'ct-span', 'id': 'span-1532-1850'}).text,
        'website': item.find('span', {'class': 'ct-span', 'id': 'span-61-1850'}).text,
        'businessModel': item.find('span', {'class': 'ct-span', 'id': 'span-44-1850'}).text,
        'proteinCategory': item.find('span', {'class': 'ct-span', 'id': 'span-1625-1850'}).text,
        'ingredients': item.find('span', {'class': 'ct-span', 'id': 'span-1664-1850'}).text,
        'endProductApplication': item.find('span', {'class': 'ct-span', 'id': 'span-1621-1850'}).text,
        }
        companyList.append(companies)
        
    return 
getCompanies(url_list)
print(companyList)

Hope someone can help a newbie out :-)

CodePudding user response:

https://proteindirectory.com/company/[url_list] is not a site address. Also, you should go after the actual href in the <a> tag, as opposed to trying to hard code the url pattern from the span-1117-390 elements you are pulling.

Next, you're going to want to iterate through the list of urls in a for loop like you did with the pages. I only went through the first 2 pages, but try this:

Code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

url_list= []
for page in range(1,76): # 94 is max; though I suspect you might get blocked by host
    print(page)    
    req = requests.get("https://proteindirectory.com/alt-protein-database/?_protein_category=plant-based&_load_more="   str(page),  headers=headers)
    soup = BeautifulSoup(req.text, 'html.parser')
    for a in soup.find_all('a',id='div_block-7-390', href=True):
      url_list.append(a['href'])

              
          
companyList = []

def getCompanies(url):
    print(url)
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    company = soup.find_all('div', {'id': 'div_block-6-1850', 'class': 'ct-div-block small-text'})
       
    for item in company:
        try:
            name = soup.find('span', {'class': 'ct-span', 'id': 'span-11-1850'}).text
        except:
            name = 'N/A'
        try:
            primaryFocus = item.find('span', {'class': 'ct-span', 'id': 'span-1554-1850'}).text
        except:
            primaryFocus = 'N/A'
        try:
            location = item.find('span', {'class': 'ct-span', 'id': 'span-41-1850'}).text
        except:
            location = 'N/A'
        try:
            founded = item.find('span', {'class': 'ct-span', 'id': 'span-1532-1850'}).text
        except:
            founded = 'N/A'
        try:
            website = item.find('span', {'class': 'ct-span', 'id': 'span-61-1850'}).text
        except:
            website = 'N/A'
        try:
            businessModel = item.find('span', {'class': 'ct-span', 'id': 'span-44-1850'}).text
        except:
            businessModel = 'N/A'
        try:
            proteinCategory = item.find('span', {'class': 'ct-span', 'id': 'span-1625-1850'}).text
        except:
            proteinCategory = 'N/A'
        try:
            ingredients = item.find('span', {'class': 'ct-span', 'id': 'span-1664-1850'}).text
        except:
            ingredients = 'N/A'
        try:
            endProductApplication = item.find('span', {'class': 'ct-span', 'id': 'span-1621-1850'}).text
        except:
            endProductApplication = 'N/A'
        
        companies = {
            'name': name,
            'primaryFocus': primaryFocus,
            'location': location,
            'founded': founded,
            'website': website,
            'businessModel': businessModel,
            'proteinCategory': proteinCategory,
            'ingredients': ingredients,
            'endProductApplication': endProductApplication
            }
        
        companyList.append(companies)
        

for url in url_list:
    getCompanies(url)
print(companyList)
df = pd.DataFrame(companyList)

Output:

print(df.to_string())
                                             name                                  primaryFocus        location founded                  website                                           businessModel            proteinCategory                                 ingredients        endProductApplication
0                               New Barn Organics                            Food and beverages   United States    2015      newbarnorganics.com                          End-consumer brands & products                Plant-based                             Almond, Coconut                  Dairy, Milk
1                                     Plantstrong                            Food and beverages   United States    2007     plantstrongfoods.com                          End-consumer brands & products                Plant-based                                         N/A  Ready-to-eat meals & snacks
2                                    Pop & Bottle                            Food and beverages   United States    2015         popandbottle.com                          End-consumer brands & products                Plant-based                                         Oat                  Dairy, Milk
3                                         Friedas                            Food and beverages   United States    1962              friedas.com                          End-consumer brands & products                Plant-based                                         Soy         Meat & fish, Sausage
4                                 Creations Foods                            Food and beverages   United States    2019       creationsfoods.com                          End-consumer brands & products                Plant-based                                         N/A       Ice-cream and desserts
5                                Biocatalysts Ltd                            Food and beverages  United Kingdom    1986         biocatalysts.com                                    Ingredients & inputs  Fermentation, Plant-based                                         N/A                          N/A
6                                          Oterra                            Food and beverages         Denmark                       oterra.com                                    Ingredients & inputs                Plant-based                                         N/A                          N/A
7                                   Sydsel Africa                            Food and beverages           Kenya    2015         sydselafrica.com                                    Ingredients & inputs                Plant-based                 Mushroom, Soy, Wheat, Yeast                          N/A
8                                    PhycoSystems                            Food and beverages         Germany    2021          phycosystems.de                                    Ingredients & inputs                Plant-based                           Algae, Microalgae                          N/A
9                                     Meta Burger                            Food and beverages   United States    2018           metaburger.com                          End-consumer brands & products                Plant-based                                         N/A          Burger, Meat & fish
10                                        C-Merak                            Food and beverages          Canada    2018               c-merak.ca                                    Ingredients & inputs                Plant-based                                   Fava bean                          N/A
11                             New Protein Global                            Food and beverages          Canada             newproteinglobal.com                                    Ingredients & inputs                Plant-based                                         Soy                          N/A
12                                         Kagome                            Food and beverages   United States    1989            kagomeusa.com  Contract manufacturing, End-consumer brands & products                Plant-based                                   Sunflower                Oils and fats
13                                       GK Foods                            Food and beverages   United States    2020               gkfoods.co                                  Contract manufacturing                Plant-based                                         N/A                          N/A
14               Global Food and Ingredients Inc.                            Food and beverages          Canada    2018        gfiglobalfood.com                                    Ingredients & inputs                Plant-based               Beans, Chickpea, Lentils, Pea                          N/A
15                                       CP Kelco                            Food and beverages   United States    1929              cpkelco.com                                    Ingredients & inputs  Fermentation, Plant-based                                         N/A                          N/A
16                                       Greenest                            Food and beverages           India    2017        greenestfoods.com                          End-consumer brands & products                Plant-based                                         N/A                  Meat & fish
17                           Montana Pure Protein                            Food and beverages   United States    2020           montanapure.us                                    Ingredients & inputs                Plant-based                                      Pulses                          N/A
18                                      Alghética                            Food and beverages           Italy    2021            alghetica.com                                    Ingredients & inputs  Fermentation, Plant-based                                       Algae                          N/A
19                         Charoen Pokphand Foods  Animal feed and pet food, Food and beverages        Thailand                 cpfworldwide.com    End-consumer brands & products, Ingredients & inputs                Plant-based                                         N/A                  Meat & fish
20                         Dahmes Stainless, Inc.                            Food and beverages   United States    1994               dahmes.com                              Infrastructure & equipment                Plant-based                                         N/A                          N/A
21  Shandong Wonderful Industrial Group Co., Ltd.                            Food and beverages           China    2001         wandefugroup.com                                    Ingredients & inputs                Plant-based                                         Soy                          N/A
22                                    Benson Hill                            Food and beverages   United States    2012           bensonhill.com                                    Ingredients & inputs                Plant-based                                    Pea, Soy                          N/A
23                Brookside Flavors & Ingredients                            Food and beverages   United States    2015     brooksideflavors.com                                    Ingredients & inputs                Plant-based                                         N/A                          N/A
24                       Cereal Ingredients (CII)                            Food and beverages   United States    1984             ciifoods.com                                    Ingredients & inputs                Plant-based  Chickpea, Fava bean, Pea, Rice, Soy, Wheat                          N/A
25                 Yantai T.Full Biotech Co. Ltd.                            Food and beverages           China    2011             en.tfull.com                                    Ingredients & inputs                Plant-based         Chickpea, Fava bean, Mung Bean, Pea                          N/A
26                  Devigere biosolutions Pvt Ltd                            Food and beverages           India    2020  devigerebiosolutions.in                                    Ingredients & inputs                Plant-based                                      Pulses                          N/A
27                                     CHKP Foods                            Food and beverages          Israel    2019            chkpfoods.com                          End-consumer brands & products                Plant-based                                    Chickpea                Dairy, Yogurt
28                            Ingredient Alliance                            Food and beverages   United States    2017             linkedin.com                                    Ingredients & inputs                Plant-based                                         N/A                          N/A
29                  Yantai Shuangta Food co., LTD                            Food and beverages           China    1992         shuangtafood.com                                    Ingredients & inputs                Plant-based                               Mushroom, Pea                          N/A
30       Shandong Jianyuan Bioengineering Co.,Ltd                            Food and beverages           China    2003        jianyuangroup.com                                    Ingredients & inputs                Plant-based                                         Pea                          N/A
31                                      Harvest B                            Food and beverages       Australia    2020              harvestb.io                                    Ingredients & inputs                Plant-based                                         N/A                          N/A
32                                Ergo Bioscience                            Food and beverages       Argentina    2020            ergofoods.com                                    Ingredients & inputs    Cultivated, Plant-based                                     Carrots           Dairy, Meat & fish
33                                     Living Jin                            Food and beverages   United States    2016            livingjin.com    End-consumer brands & products, Ingredients & inputs                Plant-based                                        Agar                          N/A
34                                        Vitmark                            Food and beverages         Ukraine    1994          int.vitmark.com                          End-consumer brands & products                Plant-based                           Almond, Oat, Rice                  Dairy, Milk
35                                      Its Veego                            Food and beverages       Australia                     itsveego.com                          End-consumer brands & products                Plant-based                          Coconut, Hemp, Pea  Ready-to-eat meals & snacks
  • Related