Home > Back-end >  Extract key in messy website with Beautiful soup
Extract key in messy website with Beautiful soup

Time:03-10

I'm new in webscraping with beautiful soup and I have some problems...

Here is my code

from bs4 import BeautifulSoup
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver


page="https://www.acheteralasource.com/producteurs-en-france/all/departement/75/page/1" 
driver = webdriver.Chrome()
driver.get(page)  
sleep(randint(2,10)) # avoid beeing blocked by IP
soup = BeautifulSoup(driver.page_source, 'html.parser')
my_table = soup.find_all(class_=['companyName', 'presentation','addressCity',\
     'addressPostalCode'])

I want to get several informations that are stocked in the targets list below but when I print my table it returnes me an empty list ...

Unfortunately there is no API available for this website..

Any help ?

CodePudding user response:

There is no api, but the data is in the <script> tag in json format:

Code:

import requests
from bs4 import BeautifulSoup
import re
import json
import pandas as pd

url = "https://www.acheteralasource.com/producteurs-en-france/all/departement/75/page/1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

r = requests.get(url, headers=headers)
page_html = r.text
soup = BeautifulSoup(page_html, 'html.parser')

scripts = soup.find_all('script')


script = str(scripts[4])

jsonStr = re.match('.*__APOLLO_STATE__=({.*})' , script).group(1)
jsonData = json.loads(jsonStr)

results = []
for k, v in jsonData.items():
    if 'Producer' in k:
        categories = v.pop('categories')['json']
        addressCoordiantes = v.pop('addressCoordinates')['json']
        
        alpha = v.copy()
        alpha.update({'categories':categories,
                  'addressCoordiantes':addressCoordiantes})
        
        results.append(alpha)

df = pd.DataFrame(results)
df['latitude'], df['longitude'] = zip(*list(df['addressCoordiantes'].values))

Output:

print(df)
        id                                 companyName  ...  latitude  longitude
0   168704                                   La SAUGE   ...  2.376545  48.880406
1   168702                                   La SAUGE   ...  2.376545  48.880406
2   164858                                cfraisclivre  ...  2.256519  48.822534
3   144464                   Le Rucher de Sainte Aulde  ...  2.434072  48.871433
4   169286                           Cultures en ville  ...  2.269878  48.830487
5   136834                                Bell'Abeille  ...  2.349333  48.883017
6   238911                    Les Ruchers de Montreuil  ...  2.442508  48.862581
7   238678                                   Miel OTT   ...  2.357560  48.870498
8   168791  BienElevées - Maison d'agriculture urbaine  ...  2.325042  48.858802
9   114454                             Dobreiu Nicolae  ...  2.345627  48.888822
10  169233                 Famille Herbelin Apiculture  ...  2.306824  48.822076
11  169495                                   APIS CIVI  ...  2.352293  48.887543
12  169394                  Association Les Ruches POP  ...  2.371658  48.879516
13   38919                                  Télé Sapin  ...  2.341048  48.862797
14   28430                         La Ferme Parisienne  ...  2.354042  48.887165
15   28428   I.T.A.V.I (Institut Technique Aviculture)  ...  2.322347  48.876362
16   18815              Maryse Gaitelli Duc de Brabant  ...  2.330939  48.897045
17   18810                     LES VIGNERONS DE CARNAS  ...  2.351173  48.835880
18   18808                    Les Domaines Qui Montent  ...  2.303042  48.881824
19   18807                                Le Nez Rouge  ...  2.302591  48.847828
20   18806           La Maison du Vin et des Vignobles  ...  2.293603  48.886036
21   18803                        Jambon-Chanrion Paul  ...  2.321920  48.859814
22   18798                  Domaine Les Roques De Cana  ...  2.468187  48.831852
23   18797                Domaine Clarence Dillon (SA)  ...  2.300900  48.870148
24   18795                             Château Margaux  ...  2.303760  48.865993
25   18792                    Champagne Louis Roederer  ...  2.322103  48.871593
26   18788                                     Bristol  ...  2.289587  48.871471
27   18787                                Borie-Manoux  ...  2.294968  48.880180
28   18785                             BOCQUILLON (SA)  ...  2.303881  48.885979
29   18780                    Vignerons de Paris (Les)  ...  2.386668  48.855598
30   18779                  Versein et Minvielle (Sté)  ...  2.336202  48.867680
31   18778                                         V 3  ...  2.339331  48.856140
32   18777                               Travers Marie  ...  2.321962  48.888638
33   18776                      Tour des Chênes (SARL)  ...  2.340296  48.839760
34   18775                        Société Des Domaines  ...  2.345637  48.838497
35   18772                                      RDVINS  ...  2.375235  48.857235
36   18771                            Quié Jean-Michel  ...  2.407137  48.825703
37   18770                           Pavillon des Vins  ...  2.392453  48.826790
38    5742                             Fromageries Bel  ...  2.320089  48.871593
39    4994                                Matines (SA)  ...  2.414774  48.867092
40    2744                          Damolini Bonduelle  ...  2.292681  48.894718
41     951                                     Kanabou  ...  2.241684  48.832230
42  154464                               Nuage Sauvage  ...  2.385420  48.870101
43   76256                                LES ABEILLES  ...  2.349466  48.827641
44   76253                         L'Abeille de France  ...  2.320717  48.880245
45   76247                                     Au Miel  ...  2.317932  48.879715
46   76244             Un apiculteur pres de chez vous  ...  2.406641  48.859982
47   76231                              Maison du Miel  ...  2.326399  48.871696

[48 rows x 11 columns]
  • Related