With the following code I try to scrape data from a website (reference: https://towardsdatascience.com/web-scraping-scraping-table-data-1665b6b2271c):
df = pd.DataFrame(columns=headings)
for i in range (102,158):
URL = 'http://bulibox.de/abschlusstabellen/'
URL_ = URL 'B100' str(i 1) '.html'
r = urllib.request.urlopen(URL_).read()
soup = BeautifulSoup(r,'lxml')
table = soup.find('table' ,attrs={'class':'abschluss'})
body = table.find_all("tr")
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all('th'):
item = (item.text).rstrip('\n')
headings.append(item)
all_rows = []
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all("td"):
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(aa)
all_rows.append(row)
df1 = pd.DataFrame(data=all_rows,columns=headings)
df.append(df1, ignore_index=True)
I 'intialized' the dataframe as an empty dataframe only with the correct column names and then tried to use a loop in order to loop over the data on the website. Partially it seems to work because df1
is the data of the last website link. But df
is still the initialized empty dataframe. I am wondering what I did wrong here?
CodePudding user response:
This is not the best strategy to append to a dataframe. Use instead a python data structure like list or dict then at the end of the loop, concat them to get your dataframe:
data = []
for i in range(102, 158)
# do stuff here
df1 = ...
data.append(df1)
df = pd.concat(data, ignore_index=True)
Output:
>>> df
Platz Mannschaft Spiele S-U-N Tore Pkt. Statistik
0 1. TSV 1860 München 34 20-10-4 80:40( 40) 50 Saison 1965/1966
1 2. Borussia Dortmund 34 19-9-6 70:36( 34) 47 Saison 1965/1966
2 3. Bayern München 34 20-7-7 71:38( 33) 47 Saison 1965/1966
3 4. Werder Bremen 34 21-3-10 76:40( 36) 45 Saison 1965/1966
4 5. 1. FC Köln 34 19-6-9 74:41( 33) 44 Saison 1965/1966
... ... ... ... ... ... ... ...
1005 14. Hertha BSC 34 8-11-15 41:52(-11) 35 Saison 2020/2021
1006 15. DSC Arminia Bielefeld 34 9-8-17 26:52(-26) 35 Saison 2020/2021
1007 16. 1. FC Köln 34 8-9-17 34:60(-26) 33 Saison 2020/2021
1008 17. SV Werder Bremen 34 7-10-17 36:57(-21) 31 Saison 2020/2021
1009 18. FC Schalke 04 34 3-7-24 25:86(-61) 16 Saison 2020/2021
[1010 rows x 7 columns]