Home > Mobile >  Why does my python "loop for" only work on the last table when I want to scrap tables from
Why does my python "loop for" only work on the last table when I want to scrap tables from

Time:07-27

I've got some issues converting table from list of urls to a large Dataframe with all the rows from different urls. It seems that my code runs well however when I want to export a new csv it only returns me the last 10 rows from the last URL instead of each url. Does someone know why ?

ps: I tried to find the answer in browsing Stack but I did not find out

import pandas as pd
from bs4 import BeautifulSoup
import requests
#Pandas/numpy for data manipulation
import numpy as np

# URL 0 - 10 SCRAPE


BASE_URL = [
'https://datan.fr/groupes/legislature-16/re',
'https://datan.fr/groupes/legislature-16/rn',
'https://datan.fr/groupes/legislature-16/lfi-nupes',
    'https://datan.fr/groupes/legislature-16/lr',
    'https://datan.fr/groupes/legislature-16/dem',
    'https://datan.fr/groupes/legislature-16/soc',
    'https://datan.fr/groupes/legislature-16/hor',
    'https://datan.fr/groupes/legislature-16/ecolo',
    'https://datan.fr/groupes/legislature-16/gdr-nupes',
    'https://datan.fr/groupes/legislature-16/liot',
]

Tous_les_groupes = []
b=0
for b in BASE_URL:

    html = requests.get(b).text
    soup = BeautifulSoup(html, "html.parser")
    #identify table we want to scrape
    Tableau_groupe = soup.find('table', {"class" : "table"})
    print(Tableau_groupe)


try:

    for row in Tableau_groupe.find_all('tr'):
        cols = row.find_all('td')
        print(cols)

        if len(cols) == 4:
            Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
            #print(Tous_les_groupes)
except:
    pass
Groupes_DF = np.asarray(Tous_les_groupes)
#print(Groupes_DF)
#print(len(Groupes_DF))

df = pd.DataFrame(Groupes_DF)
df.columns = ['url','G', 'Tx', 'note ','Number']
#print(df.head(10))

df.to_csv('output.csv')

Thanks for your help, and all have a great day.

CodePudding user response:

In the first loop you assign the result of soup.find to Tableau_groupe, but each time it "overwrites" the previous value, thus mantaining only the last value.

Try moving the second for loop together with the first one:

for b in BASE_URL:

    html = requests.get(b).text
    soup = BeautifulSoup(html, "html.parser")
    #identify table we want to scrape
    Tableau_groupe = soup.find('table', {"class" : "table"})
    print(Tableau_groupe)


    try:

        for row in Tableau_groupe.find_all('tr'):
            cols = row.find_all('td')
            print(cols)

            if len(cols) == 4:
                Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))

    except:
        pass

  • Related