I've got some issues converting table from list of urls to a large Dataframe with all the rows from different urls. It seems that my code runs well however when I want to export a new csv it only returns me the last 10 rows from the last URL instead of each url. Does someone know why ?
ps: I tried to find the answer in browsing Stack but I did not find out
import pandas as pd
from bs4 import BeautifulSoup
import requests
#Pandas/numpy for data manipulation
import numpy as np
# URL 0 - 10 SCRAPE
BASE_URL = [
'https://datan.fr/groupes/legislature-16/re',
'https://datan.fr/groupes/legislature-16/rn',
'https://datan.fr/groupes/legislature-16/lfi-nupes',
'https://datan.fr/groupes/legislature-16/lr',
'https://datan.fr/groupes/legislature-16/dem',
'https://datan.fr/groupes/legislature-16/soc',
'https://datan.fr/groupes/legislature-16/hor',
'https://datan.fr/groupes/legislature-16/ecolo',
'https://datan.fr/groupes/legislature-16/gdr-nupes',
'https://datan.fr/groupes/legislature-16/liot',
]
Tous_les_groupes = []
b=0
for b in BASE_URL:
html = requests.get(b).text
soup = BeautifulSoup(html, "html.parser")
#identify table we want to scrape
Tableau_groupe = soup.find('table', {"class" : "table"})
print(Tableau_groupe)
try:
for row in Tableau_groupe.find_all('tr'):
cols = row.find_all('td')
print(cols)
if len(cols) == 4:
Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
#print(Tous_les_groupes)
except:
pass
Groupes_DF = np.asarray(Tous_les_groupes)
#print(Groupes_DF)
#print(len(Groupes_DF))
df = pd.DataFrame(Groupes_DF)
df.columns = ['url','G', 'Tx', 'note ','Number']
#print(df.head(10))
df.to_csv('output.csv')
Thanks for your help, and all have a great day.
CodePudding user response:
In the first loop you assign the result of soup.find
to Tableau_groupe
, but each time it "overwrites" the previous value, thus mantaining only the last value.
Try moving the second for loop together with the first one:
for b in BASE_URL:
html = requests.get(b).text
soup = BeautifulSoup(html, "html.parser")
#identify table we want to scrape
Tableau_groupe = soup.find('table', {"class" : "table"})
print(Tableau_groupe)
try:
for row in Tableau_groupe.find_all('tr'):
cols = row.find_all('td')
print(cols)
if len(cols) == 4:
Tous_les_groupes.append((b, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip()))
except:
pass