I want to learn Python and have chosen a small private Football Data project for it. I have the following problem: I want to pull the data of the past 4 seasons. This works with the code below so far. But now I want to filter out the teams for each league, which were not in all 4 seasons (the relegated teams should disappear). I have no idea how to do this, because it only works for the individual leagues. So it must be iterated over each season per league and not over all seasons of all leagues.
import pandas as pd
import numpy as np
# leagues for England. E0 is Premier League, E1 is Championship and so on...
leagues = ["E0", "E1", "E2", "E3", "EC"]
seasons = ["2223", "2122", "2021", "1920"]
baseUrl = "https://www.football-data.co.uk/mmz4281/"
urls = []
for league in leagues:
for season in seasons:
url = str(baseUrl) str(season) "/" str(league) ".csv"
urls.append(url)
# load the data.
column_names = ["Div", "HomeTeam", "AwayTeam", "FTHG", "FTAG", "FTR"]
dfs = [pd.read_csv(url, encoding='cp1252', usecols=column_names)
for url in urls]
df = pd.concat(dfs, ignore_index=True)
So example: If a team is relegated from E0
to E1
in Season 2021
, then it will not show up in E0
in Season 2122
. If this is the case, all rows in all 4 seasons of E0
where this team appears should be deleted, because I want cleaned data without promotion/relegation.
How can I implement this?
CodePudding user response:
Your code is almost ready. You only need to add a small for-loop
filtering teams which played in more than one division:
print(df.shape)
# (8264, 6)
for team in df.HomeTeam.unique():
played_divs = df[df.HomeTeam==team].Div.unique()
if len(played_divs) > 1:
df = df[(df.HomeTeam != team)*(df.AwayTeam != team)]
print(df.shape)
# (2948, 6) (5316 rows were filtered for me)