I am trying to append scraped data to a dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
import csv
url="https://en.wikipedia.org/wiki/List_of_German_football_champions"
page=requests.get(url).content
soup=BeautifulSoup(page,"html.parser")
seasons=[]
first_places=[]
runner_ups=[]
third_places=[]
top_scorrers=[]
tbody=soup.find_all("tbody")[7]
trs=tbody.find_all("tr")
for tr in trs:
season = tr.find_all("a")[0].text
first_place = tr.find_all("a")[1].text
runner_up = tr.find_all("a")[2].text
third_place = tr.find_all("a")[3].text
top_scorer = tr.find_all("a")[4].text
seasons.append(season)
first_places.append(first_place)
runner_ups.append(runner_up)
third_places.append(third_place)
top_scorrers.append(top_scorer)
tuples=list(zip(seasons,first_places,runner_ups,third_places,top_scorrers))
df=pd.DataFrame(tuples,columns=["Season","FirstPlace","RunnerUp","ThirdPlace","TopScorrer"])
df
Is there an easier way to append data directly to an empty dataframe without creating lists and then zipping them?
CodePudding user response:
The easiest way is to use pd.read_html
. If you want to get the table from "Bundesliga (since 1963)", extract the 8th table:
# Extract all <table> from url
dfs = pd.read_html(url)
# Bundeliga (since 1963)
df = dfs[7]
# Cleanup your dataframe
df.columns = df.columns.str.split('[').str[0]
df['Champions'] = df['Champions'].str.extract(r'([^(] )')
Output
>>> df
Season Champions Runners-up Third place Top scorer(s) Goals
0 1963–64 1. FC Köln Meidericher SV Eintracht Frankfurt Uwe Seeler 30
1 1964–65 Werder Bremen 1. FC Köln Borussia Dortmund Rudi Brunnenmeier 24
2 1965–66 TSV 1860 Munich Borussia Dortmund Bayern Munich Friedhelm Konietzka 26
3 1966–67 Eintracht Braunschweig TSV 1860 Munich Borussia Dortmund Lothar Emmerich, Gerd Müller 28
4 1967–68 1. FC Nürnberg Werder Bremen Borussia Mönchengladbach Hannes Löhr 27
5 1968–69 Bayern Munich Alemannia Aachen Borussia Mönchengladbach Gerd Müller 30
6 1969–70 Borussia Mönchengladbach Bayern Munich Hertha BSC Gerd Müller 38
7 1970–71 Borussia Mönchengladbach Bayern Munich Hertha BSC Lothar Kobluhn 24
8 1971–72 Bayern Munich Schalke 04 Borussia Mönchengladbach Gerd Müller 40
9 1972–73 Bayern Munich 1. FC Köln Fortuna Düsseldorf Gerd Müller 36
10 1973–74 Bayern Munich Borussia Mönchengladbach Fortuna Düsseldorf Josef Heynckes, Gerd Müller 30
11 1974–75 Borussia Mönchengladbach Hertha BSC Hamburger SV Josef Heynckes 27
12 1975–76 Borussia Mönchengladbach Hamburger SV Bayern Munich Klaus Fischer 29
13 1976–77 Borussia Mönchengladbach Schalke 04 Eintracht Braunschweig Dieter Müller 34
14 1977–78 1. FC Köln Borussia Mönchengladbach Hertha BSC Dieter Müller, Gerd Müller 24
15 1978–79 Hamburger SV VfB Stuttgart 1. FC Kaiserslautern Klaus Allofs 22
16 1979–80 Bayern Munich Hamburger SV VfB Stuttgart Karl-Heinz Rummenigge 26
17 1980–81 Bayern Munich Hamburger SV VfB Stuttgart Karl-Heinz Rummenigge 29
18 1981–82 Hamburger SV 1. FC Köln Bayern Munich Horst Hrubesch 27
19 1982–83 Hamburger SV Werder Bremen VfB Stuttgart Rudi Völler 23
20 1983–84 VfB Stuttgart Hamburger SV Borussia Mönchengladbach Karl-Heinz Rummenigge 26
21 1984–85 Bayern Munich Werder Bremen 1. FC Köln Klaus Allofs 26
22 1985–86 Bayern Munich Werder Bremen Bayer Uerdingen Stefan Kuntz 22
23 1986–87 Bayern Munich Hamburger SV Borussia Mönchengladbach Uwe Rahn 24
24 1987–88 Werder Bremen Bayern Munich 1. FC Köln Jürgen Klinsmann 19
25 1988–89 Bayern Munich 1. FC Köln Werder Bremen Thomas Allofs, Roland Wohlfarth 17
26 1989–90 Bayern Munich 1. FC Köln Eintracht Frankfurt Jørn Andersen 18
27 1990–91 1. FC Kaiserslautern Bayern Munich Werder Bremen Roland Wohlfarth 21
28 1991–92 VfB Stuttgart Borussia Dortmund Eintracht Frankfurt Fritz Walter 22
29 1992–93 Werder Bremen Bayern Munich Eintracht Frankfurt Ulf Kirsten, Anthony Yeboah 20
30 1993–94 Bayern Munich 1. FC Kaiserslautern Bayer Leverkusen Stefan Kuntz, Anthony Yeboah 18
31 1994–95 Borussia Dortmund Werder Bremen Freiburg Heiko Herrlich, Mario Basler 20
32 1995–96 Borussia Dortmund Bayern Munich Schalke 04 Fredi Bobic 17
33 1996–97 Bayern Munich Bayer Leverkusen Borussia Dortmund Ulf Kirsten 22
34 1997–98 1. FC Kaiserslautern Bayern Munich Bayer Leverkusen Ulf Kirsten 22
35 1998–99 Bayern Munich Bayer Leverkusen Hertha BSC Michael Preetz 23
36 1999–2000 Bayern Munich Bayer Leverkusen Hamburger SV Martin Max 19
37 2000–01 Bayern Munich Schalke 04 Borussia Dortmund Sergej Barbarez, Ebbe Sand 22
38 2001–02 Borussia Dortmund Bayer Leverkusen Bayern Munich Márcio Amoroso, Martin Max 18
39 2002–03 Bayern Munich VfB Stuttgart Borussia Dortmund Giovane Élber, Thomas Christiansen 21
40 2003–04 Werder Bremen Bayern Munich Bayer Leverkusen Aílton 28
41 2004–05 Bayern Munich Schalke 04 Werder Bremen Marek Mintál 24
42 2005–06 Bayern Munich Werder Bremen Hamburger SV Miroslav Klose 25
43 2006–07 VfB Stuttgart Schalke 04 Werder Bremen Theofanis Gekas 20
44 2007–08 Bayern Munich Werder Bremen Schalke 04 Luca Toni 24
45 2008–09 VfL Wolfsburg Bayern Munich VfB Stuttgart Grafite 28
46 2009–10 Bayern Munich Schalke 04 Werder Bremen Edin Džeko 22
47 2010–11 Borussia Dortmund Bayer Leverkusen Bayern Munich Mario Gómez 28
48 2011–12 Borussia Dortmund Bayern Munich Schalke 04 Klaas-Jan Huntelaar 29
49 2012–13 Bayern Munich Borussia Dortmund Bayer Leverkusen Stefan Kießling 25
50 2013–14 Bayern Munich Borussia Dortmund Schalke 04 Robert Lewandowski 20
51 2014–15 Bayern Munich VfL Wolfsburg Borussia Mönchengladbach Alexander Meier 19
52 2015–16 Bayern Munich Borussia Dortmund Bayer Leverkusen Robert Lewandowski 30
53 2016–17 Bayern Munich RB Leipzig Borussia Dortmund Pierre-Emerick Aubameyang 31
54 2017–18 Bayern Munich Schalke 04 1899 Hoffenheim Robert Lewandowski 29
55 2018–19 Bayern Munich Borussia Dortmund RB Leipzig Robert Lewandowski 22
56 2019–20 Bayern Munich Borussia Dortmund RB Leipzig Robert Lewandowski 34
57 2020–21 Bayern Munich RB Leipzig Borussia Dortmund Robert Lewandowski 41
58 2021–22 Bayern Munich Borussia Dortmund Bayer Leverkusen Robert Lewandowski 35
CodePudding user response:
While still using pandas
"simplest" way to create your DataFrame is going with pandas.read_html()
:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_German_football_champions')[7]
To simply rename the columns and get rid of the [7]
:
df.columns = ['Season', 'Champions', 'Runners-up', 'Third place',
'Top scorer(s)', 'Goals']
Output:
Season | Champions | Runners-up | Third place | Top scorer(s) | Goals | |
---|---|---|---|---|---|---|
0 | 1963–64 | 1. FC Köln (2) | Meidericher SV | Eintracht Frankfurt | Uwe Seeler | 30 |
1 | 1964–65 | Werder Bremen (1) | 1. FC Köln | Borussia Dortmund | Rudi Brunnenmeier | 24 |
2 | 1965–66 | TSV 1860 Munich (1) | Borussia Dortmund | Bayern Munich | Friedhelm Konietzka | 26 |
3 | 1966–67 | Eintracht Braunschweig (1) | TSV 1860 Munich | Borussia Dortmund | Lothar Emmerich, Gerd Müller | 28 |
4 | 1967–68 | 1. FC Nürnberg (9) | Werder Bremen | Borussia Mönchengladbach | Hannes Löhr | 27 |
...
An alternativ to avoid all these lists, get cleaner in process and using BeautifulSoup
directly is to create more structured data - A single list of dicts:
data = []
for tr in soup.select('table:nth-of-type(8) tr:not(:has(th))'):
data.append({
'season':tr.find_all("a")[0].text,
'first_place': tr.find_all("a")[1].text,
'runner_up': tr.find_all("a")[2].text,
'third_place': tr.find_all("a")[3].text,
'top_scorer': tr.find_all("a")[4].text,
})
pd.DataFrame(data)
Example
import pandas as pd
from bs4 import BeautifulSoup
import requests
url="https://en.wikipedia.org/wiki/List_of_German_football_champions"
page=requests.get(url).content
soup=BeautifulSoup(page,"html.parser")
data = []
for tr in soup.select('table:nth-of-type(8) tr:not(:has(th))'):
data.append({
'season':tr.find_all("a")[0].text,
'first_place': tr.find_all("a")[1].text,
'runner_up': tr.find_all("a")[2].text,
'third_place': tr.find_all("a")[3].text,
'top_scorer': tr.find_all("a")[4].text,
})
pd.DataFrame(data)