So, my objective with this code is extract information about a rookie in my NFL team. I want to compare his performance with players that finished last season in top-10 statistics in his position when they were in first season where they played more than 10 games.
For that, I used as reference a post in Towards Data Science, where they explained how to scrape NFL data.
Here is my code:
#url page
url_mac = 'https://www.pro-football-reference.com/years/2021/passing.htm'
#opening URL with BS
html_mac = urlopen(url_mac)
stats_macpage = BeautifulSoup(html_mac)
#collecting table rows
column_headers = stats_macpage.findAll('tr')[0]
column_headers = [i.getText() for i in column_headers.findAll('th')]
#getting stats of each row
rows = stats_macpage.findAll('tr')[1:]
qb_stats = []
for i in range(len(rows)):
qb_stats.append([col.getText() for col in rows[i].findAll('td')])
#creating a data frame
data = pd.DataFrame(qb_stats, columns = column_headers[1:])
#rename column of sack yards from yards to y_sack
new_columns = data.columns.values
new_columns[-6] = 'y_sack'
data.columns = new_columns
#selecting specifics stats
categories = ['Cmp%', 'Yds', 'TD', 'Int', 'Y/A', 'Rate']
#first filter
data_radar = data[['Player', 'Tm'] categories]
#selecting specific player
data_mac = data_radar[data_radar['Player'] == 'Mac Jones']
I did it for all 11 players that I want the data and I had concatenated in the final, but you can imagine how bad my code is looking.
How can I improve it to create a loop ? I already tried some things, but for all of these ideas, either they didn't work well or were beyond my capabilities to execute successfully.
Between my ideas was take all data of the last 20 years and try find year by year, but that look a bit unnecessary, because I already know what years I want. My problem is specifically in this part, because in the final I could create a list and then an "if" and only take the first year of each player in my list played > 10 games.
url_mac = 'https://www.pro-football-reference.com/years/2021/passing.htm'
Thank you, all.
CodePudding user response:
To load all tables from years 2011-2021 to one dataframe, you can use this example:
import pandas as pd
# url page
url_mac = "https://www.pro-football-reference.com/years/{}/passing.htm"
all_dfs = []
for year in range(2011, 2022):
print("Getting table for year {}".format(year))
df = pd.read_html(url_mac.format(year))[0]
df = df[~df["Rk"].eq("Rk")]
df["Year"] = year
all_dfs.append(df)
df = pd.concat(all_dfs)
print(df)
df.to_csv("data.csv", index=False)
Prints:
Rk Player Tm Age Pos G GS QBrec Cmp Att Cmp% Yds TD TD% Int Int% 1D Lng Y/A AY/A Y/C Y/G Rate QBR Sk Yds.1 NY/A ANY/A Sk% 4QC GWD Year
0 1 Drew Brees* NOR 32 QB 16 16 13-3-0 468 657 71.2 5476 46 7.0 14 2.1 279 79 8.3 8.8 11.7 342.3 110.6 82.3 24 158 7.81 8.23 3.5 3 4 2011
1 2 Tom Brady* NWE 34 QB 16 16 13-3-0 401 611 65.6 5235 39 6.4 12 2.0 262 99 8.6 9.0 13.1 327.2 105.6 73.8 32 173 7.87 8.25 5.0 1 2 2011
2 3 Matthew Stafford DET 23 QB 16 16 10-6-0 421 663 63.5 5038 41 6.2 16 2.4 243 73 7.6 7.7 12.0 314.9 97.2 60.5 36 257 6.84 6.98 5.2 3 4 2011
3 4 Eli Manning* NYG 30 QB 16 16 9-7-0 359 589 61.0 4933 29 4.9 16 2.7 218 99 8.4 8.1 13.7 308.3 92.9 64.2 28 199 7.67 7.45 4.5 5 6 2011
...
and saves data.csv
(screenshot from LibreOffice):
CodePudding user response:
Go with pandas
to parse those tables. Just loop through the seasons in the url (don't forget to add that in the dataframes otherwise you won;t know which rows are for what seasons:
import pandas as pd
# To get last 10 seasons
years_look_back = 10
years_list = list(range(2021,2021-(years_look_back 1),-1))
years_list.reverse()
# Or just make a list of specific years:
years_list = [2001, 2005, 2008, 2010, 2011, 2015]
cols = ['Player', 'Tm','Cmp%', 'Yds', 'TD', 'Int', 'Y/A', 'Rate']
df_list = []
for year in years_list:
url_mac = f'https://www.pro-football-reference.com/years/{year}/passing.htm'
temp_df = pd.read_html(url_mac)[0][cols]
temp_df['Season'] = year
temp_df = temp_df[~temp_df["Player"].eq("Player")]
df_list.append(temp_df)
print(f'Collect: {year}')
data_radar = pd.concat(df_list)
#selecting specific player
data_mac = data_radar[data_radar['Player'] == 'Mac Jones']