Pandas Method Chaining: getting KeyError on calculated column-CodePudding

I’m scraping

The dataframe has columns for Rank and Pts (Points). The Rank column (dytpe object) contains numeric ranks of 1-25 plus “RV” for teams that received points but did not rank in the top 25. The Pts column is dtype int64. Since Pts for teams that did not make the top 25 are included in the data, I’m able to re-rank the teams based on Pts and thus extend rankings beyond the top 25. The resulting revrank column ranks teams from 1 to between 37 and 61, depending how many teams received points in that poll. Revrank is the first new column I create.

The revrank column should equal the Rank column for the first 25 teams, but before I can test it I need to create a new column that converts Rank to numeric. The result is rank_int, which is my second created column. Then I try to create a third column that calculates the difference between the two created columns, and this is where I get the KeyError. Here's the chain:

all_wks_clean = (all_wks_raw
    #create new column that converts Rank to numeric-this works
    .assign(rank_int = pd.to_numeric(all_wks_raw['Rank'], errors='coerce').fillna(0))
                 
    #create new column that re-ranks teams based on Points: extends rankings beyond original 25-this works
    .assign(gprank = all_wks_raw.reset_index(drop=True).groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
                 
    #create new column that takes the difference between gprank and rank_int columns created above-this fails with KeyError: 'gprank'
    .assign(ck_rank = all_wks_raw['gprank'] - all_wks_raw['rank_int'])
)

Are the results of the first two assignments not being passed to the third? Am I missing something in the syntax? Thanks for the help.

Edited 7/20/2022 to add complete code; note that this code scrapes data from the College Poll Archive web site:

dict = {1119: [2016, '2016 Final AP Football Poll', 'Final'], 1120: [2017, '2017 Preseason AP Football Poll', 'Preseason'],
        1135: [2017, '2017 Final AP Football Poll', 'Final'], 1136: [2018, '2018 Preseason AP Football Poll', 'Preseason'],
        1151: [2018, '2018 Final AP Football Poll', 'Final'], 1152: [2019, '2019 Preseason AP Football Poll', 'Preseason']}

#get one week of poll data from College Poll Archive ID parameter
def getdata(id):
    coldefs = {'ID':key, 'Year': value[0], 'Title': value[1], 'Type':value[2]} #define dictionary of scalar columns to add to dataframe
    urlseg = 'https://www.collegepollarchive.com/football/ap/seasons.cfm?appollid='
    url = urlseg   str(id)
    dfs = pd.read_html(url)
    df = dfs[0].assign(**coldefs)
    return df

all_wks_raw = pd.DataFrame()

for key, value in dict.items():
    print(key, value[0], value[2])
    onewk = getdata(key)
    all_wks_raw = all_wks_raw.append(onewk)
   
all_wks_clean = (all_wks_raw
    #create new column that converts Rank to numeric-this works
    .assign(rank_int = pd.to_numeric(all_wks_raw['Rank'], errors='coerce').fillna(0))
                 
    #create new column that re-ranks teams based on Points: extends rankings beyond original 25-this works
    .assign(gprank = all_wks_raw.reset_index(drop=True).groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
                 
    #create new column that takes the difference between gprank and rank_int columns created above-this fails with KeyError: 'gprank'
    .assign(ck_rank = all_wks_raw['gprank'] - all_wks_raw['rank_int'])
)

CodePudding user response：

If accessing a column that doesn't yet exist, that must be done through a lambda:

dfs = pd.read_html('https://www.collegepollarchive.com/football/ap/seasons.cfm?seasonid=2019')
df = dfs[0][['Team (FPV)', 'Rank', 'Pts']].copy()
df['Year'] = 2016
df['Type'] = 'final'
df = df.assign(rank_int = pd.to_numeric(df['Rank'], errors='coerce').fillna(0).astype(int),
               gprank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
               ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
print(df)

Output:

            Team (FPV) Rank   Pts  Year   Type  rank_int  gprank  ck_rank
0             LSU (62)    1  1550  2016  final         1     1.0      0.0
1              Clemson    2  1487  2016  final         2     2.0      0.0
2           Ohio State    3  1426  2016  final         3     3.0      0.0
3              Georgia    4  1336  2016  final         4     4.0      0.0
4               Oregon    5  1249  2016  final         5     5.0      0.0
5              Florida    6  1211  2016  final         6     6.0      0.0
6             Oklahoma    7  1179  2016  final         7     7.0      0.0
7              Alabama    8  1159  2016  final         8     8.0      0.0
8           Penn State    9  1038  2016  final         9     9.0      0.0
9            Minnesota   10   952  2016  final        10    10.0      0.0
10           Wisconsin   11   883  2016  final        11    11.0      0.0
11          Notre Dame   12   879  2016  final        12    12.0      0.0
12              Baylor   13   827  2016  final        13    13.0      0.0
13              Auburn   14   726  2016  final        14    14.0      0.0
14                Iowa   15   699  2016  final        15    15.0      0.0
15                Utah   16   543  2016  final        16    16.0      0.0
16             Memphis   17   528  2016  final        17    17.0      0.0
17            Michigan   18   468  2016  final        18    18.0      0.0
18   Appalachian State   19   466  2016  final        19    19.0      0.0
19                Navy   20   415  2016  final        20    20.0      0.0
20          Cincinnati   21   343  2016  final        21    21.0      0.0
21           Air Force   22   209  2016  final        22    22.0      0.0
22         Boise State   23   188  2016  final        23    23.0      0.0
23                 UCF   24    78  2016  final        24    24.0      0.0
24               Texas   25    69  2016  final        25    25.0      0.0
25           Texas A&M   RV    54  2016  final         0    26.0     26.0
26    Florida Atlantic   RV    46  2016  final         0    27.0     27.0
27          Washington   RV    39  2016  final         0    28.0     28.0
28            Virginia   RV    28  2016  final         0    29.0     29.0
29                 USC   RV    16  2016  final         0    30.0     30.0
30     San Diego State   RV    13  2016  final         0    31.0     31.0
31       Arizona State   RV    12  2016  final         0    32.0     32.0
32                 SMU   RV    10  2016  final         0    33.0     33.0
33           Tennessee   RV     8  2016  final         0    34.0     34.0
34          California   RV     6  2016  final         0    35.0     35.0
35        Kansas State   RV     2  2016  final         0    36.0     36.0
36            Kentucky   RV     2  2016  final         0    36.0     36.0
37           Louisiana   RV     2  2016  final         0    36.0     36.0
38      Louisiana Tech   RV     2  2016  final         0    36.0     36.0
39  North Dakota State   RV     2  2016  final         0    36.0     36.0
40              Hawaii   NR     0  2016  final         0    41.0     41.0
41          Louisville   NR     0  2016  final         0    41.0     41.0
42      Oklahoma State   NR     0  2016  final         0    41.0     41.0

CodePudding user response：

Adding to BeRT2me's answer, when chaining, lambda's are pretty much always the way to go. When you use the original dataframe name, pandas looks at the dataframe as it was before the statement was executed. To avoid confusion, go with:

df = df.assign(rank_int = lambda x: pd.to_numeric(x['Rank'], errors='coerce').fillna(0).astype(int),
               gprank = lambda x: x.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
               ck_rank = lambda x: x['gprank'].sub(x['rank_int']))

The x you define is the dataframe at that state in the pipe.

This helps especially when your pipes get longer. E.g, if you filter out some rows or aggregate you get different results (or maybe error) depending what you're trying to do.

For example, if you were just looking at the relative rank of 3 teams:

df = pd.DataFrame({
    'Team (FPV)': list('abcde'), 
    'Rank': list(range(5)),
    'Pts':  list(range(5)),
})

df['Year'] = 2016
df['Type'] = 'final'
df = (df
      .loc[lambda x: x['Team (FPV)'].isin(["b", "c", "d"])]
      .assign(bcd_rank = lambda x: x.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
     )
print(df)

gives:

  Team (FPV)  Rank  Pts  Year   Type  bcd_rank
1          b     1    1  2016  final       3.0
2          c     2    2  2016  final       2.0
3          d     3    3  2016  final       1.0

Whereas:

df = pd.DataFrame({
    'Team (FPV)': list('abcde'), 
    'Rank': list(range(5)),
    'Pts':  list(range(5)),
})

df['Year'] = 2016
df['Type'] = 'final'
df = (df
      .loc[lambda x: x['Team (FPV)'].isin(["b", "c", "d"])]
      .assign(bcd_rank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'))
     )
print(df)

gives a different ranking:

  Team (FPV)  Rank  Pts  Year   Type  bcd_rank
1          b     1    1  2016  final       4.0
2          c     2    2  2016  final       3.0
3          d     3    3  2016  final       2.0

If you want to go deeper, I'd recommend https://tomaugspurger.github.io/method-chaining.html to go on your reading list.