The dataframe has columns for Rank and Pts (Points). The Rank column (dytpe object) contains numeric ranks of 1-25 plus “RV” for teams that received points but did not rank in the top 25. The Pts column is dtype int64. Since Pts for teams that did not make the top 25 are included in the data, I’m able to re-rank the teams based on Pts and thus extend rankings beyond the top 25. The resulting revrank column ranks teams from 1 to between 37 and 61, depending how many teams received points in that poll. Revrank is the first new column I create.
The revrank column should equal the Rank column for the first 25 teams, but before I can test it I need to create a new column that converts Rank to numeric. The result is rank_int, which is my second created column. Then I try to create a third column that calculates the difference between the two created columns, and this is where I get the KeyError. Here's the chain:
all_wks_clean = (all_wks_raw
#create new column that converts Rank to numeric-this works
.assign(rank_int = pd.to_numeric(all_wks_raw['Rank'], errors='coerce').fillna(0))
#create new column that re-ranks teams based on Points: extends rankings beyond original 25-this works
.assign(gprank = all_wks_raw.reset_index(drop=True).groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
#create new column that takes the difference between gprank and rank_int columns created above-this fails with KeyError: 'gprank'
.assign(ck_rank = all_wks_raw['gprank'] - all_wks_raw['rank_int'])
)
Are the results of the first two assignments not being passed to the third? Am I missing something in the syntax? Thanks for the help.
Edited 7/20/2022 to add complete code; note that this code scrapes data from the College Poll Archive web site:
dict = {1119: [2016, '2016 Final AP Football Poll', 'Final'], 1120: [2017, '2017 Preseason AP Football Poll', 'Preseason'],
1135: [2017, '2017 Final AP Football Poll', 'Final'], 1136: [2018, '2018 Preseason AP Football Poll', 'Preseason'],
1151: [2018, '2018 Final AP Football Poll', 'Final'], 1152: [2019, '2019 Preseason AP Football Poll', 'Preseason']}
#get one week of poll data from College Poll Archive ID parameter
def getdata(id):
coldefs = {'ID':key, 'Year': value[0], 'Title': value[1], 'Type':value[2]} #define dictionary of scalar columns to add to dataframe
urlseg = 'https://www.collegepollarchive.com/football/ap/seasons.cfm?appollid='
url = urlseg str(id)
dfs = pd.read_html(url)
df = dfs[0].assign(**coldefs)
return df
all_wks_raw = pd.DataFrame()
for key, value in dict.items():
print(key, value[0], value[2])
onewk = getdata(key)
all_wks_raw = all_wks_raw.append(onewk)
all_wks_clean = (all_wks_raw
#create new column that converts Rank to numeric-this works
.assign(rank_int = pd.to_numeric(all_wks_raw['Rank'], errors='coerce').fillna(0))
#create new column that re-ranks teams based on Points: extends rankings beyond original 25-this works
.assign(gprank = all_wks_raw.reset_index(drop=True).groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
#create new column that takes the difference between gprank and rank_int columns created above-this fails with KeyError: 'gprank'
.assign(ck_rank = all_wks_raw['gprank'] - all_wks_raw['rank_int'])
)
CodePudding user response:
If accessing a column that doesn't yet exist, that must be done through a lambda
:
dfs = pd.read_html('https://www.collegepollarchive.com/football/ap/seasons.cfm?seasonid=2019')
df = dfs[0][['Team (FPV)', 'Rank', 'Pts']].copy()
df['Year'] = 2016
df['Type'] = 'final'
df = df.assign(rank_int = pd.to_numeric(df['Rank'], errors='coerce').fillna(0).astype(int),
gprank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
print(df)
Output:
Team (FPV) Rank Pts Year Type rank_int gprank ck_rank
0 LSU (62) 1 1550 2016 final 1 1.0 0.0
1 Clemson 2 1487 2016 final 2 2.0 0.0
2 Ohio State 3 1426 2016 final 3 3.0 0.0
3 Georgia 4 1336 2016 final 4 4.0 0.0
4 Oregon 5 1249 2016 final 5 5.0 0.0
5 Florida 6 1211 2016 final 6 6.0 0.0
6 Oklahoma 7 1179 2016 final 7 7.0 0.0
7 Alabama 8 1159 2016 final 8 8.0 0.0
8 Penn State 9 1038 2016 final 9 9.0 0.0
9 Minnesota 10 952 2016 final 10 10.0 0.0
10 Wisconsin 11 883 2016 final 11 11.0 0.0
11 Notre Dame 12 879 2016 final 12 12.0 0.0
12 Baylor 13 827 2016 final 13 13.0 0.0
13 Auburn 14 726 2016 final 14 14.0 0.0
14 Iowa 15 699 2016 final 15 15.0 0.0
15 Utah 16 543 2016 final 16 16.0 0.0
16 Memphis 17 528 2016 final 17 17.0 0.0
17 Michigan 18 468 2016 final 18 18.0 0.0
18 Appalachian State 19 466 2016 final 19 19.0 0.0
19 Navy 20 415 2016 final 20 20.0 0.0
20 Cincinnati 21 343 2016 final 21 21.0 0.0
21 Air Force 22 209 2016 final 22 22.0 0.0
22 Boise State 23 188 2016 final 23 23.0 0.0
23 UCF 24 78 2016 final 24 24.0 0.0
24 Texas 25 69 2016 final 25 25.0 0.0
25 Texas A&M RV 54 2016 final 0 26.0 26.0
26 Florida Atlantic RV 46 2016 final 0 27.0 27.0
27 Washington RV 39 2016 final 0 28.0 28.0
28 Virginia RV 28 2016 final 0 29.0 29.0
29 USC RV 16 2016 final 0 30.0 30.0
30 San Diego State RV 13 2016 final 0 31.0 31.0
31 Arizona State RV 12 2016 final 0 32.0 32.0
32 SMU RV 10 2016 final 0 33.0 33.0
33 Tennessee RV 8 2016 final 0 34.0 34.0
34 California RV 6 2016 final 0 35.0 35.0
35 Kansas State RV 2 2016 final 0 36.0 36.0
36 Kentucky RV 2 2016 final 0 36.0 36.0
37 Louisiana RV 2 2016 final 0 36.0 36.0
38 Louisiana Tech RV 2 2016 final 0 36.0 36.0
39 North Dakota State RV 2 2016 final 0 36.0 36.0
40 Hawaii NR 0 2016 final 0 41.0 41.0
41 Louisville NR 0 2016 final 0 41.0 41.0
42 Oklahoma State NR 0 2016 final 0 41.0 41.0
CodePudding user response:
Adding to BeRT2me's answer, when chaining, lambda's are pretty much always the way to go. When you use the original dataframe name, pandas looks at the dataframe as it was before the statement was executed. To avoid confusion, go with:
df = df.assign(rank_int = lambda x: pd.to_numeric(x['Rank'], errors='coerce').fillna(0).astype(int),
gprank = lambda x: x.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
The x
you define is the dataframe at that state in the pipe.
This helps especially when your pipes get longer. E.g, if you filter out some rows or aggregate you get different results (or maybe error) depending what you're trying to do.
For example, if you were just looking at the relative rank of 3 teams:
df = pd.DataFrame({
'Team (FPV)': list('abcde'),
'Rank': list(range(5)),
'Pts': list(range(5)),
})
df['Year'] = 2016
df['Type'] = 'final'
df = (df
.loc[lambda x: x['Team (FPV)'].isin(["b", "c", "d"])]
.assign(bcd_rank = lambda x: x.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
)
print(df)
gives:
Team (FPV) Rank Pts Year Type bcd_rank
1 b 1 1 2016 final 3.0
2 c 2 2 2016 final 2.0
3 d 3 3 2016 final 1.0
Whereas:
df = pd.DataFrame({
'Team (FPV)': list('abcde'),
'Rank': list(range(5)),
'Pts': list(range(5)),
})
df['Year'] = 2016
df['Type'] = 'final'
df = (df
.loc[lambda x: x['Team (FPV)'].isin(["b", "c", "d"])]
.assign(bcd_rank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
)
print(df)
gives a different ranking:
Team (FPV) Rank Pts Year Type bcd_rank
1 b 1 1 2016 final 4.0
2 c 2 2 2016 final 3.0
3 d 3 3 2016 final 2.0
If you want to go deeper, I'd recommend https://tomaugspurger.github.io/method-chaining.html to go on your reading list.