I have the following DataFrame:
Date | Track | Surface | FGrating | HorseId | Last FGrating at Sha Tin, grass |
---|---|---|---|---|---|
2017-09-03 | Sha Tin | Grass | 110 | 1736 | |
2017-09-16 | Sha Tin | Grass | 124 | 1736 | 110 |
2017-10-14 | Sha Tin | Grass | 118 | 1736 | 124 |
2017-11-11 | Sha Tin | Grass | 107 | 1736 | 118 |
2018-03-28 | Happy Valley | Grass | 117 | 1736 | |
2018-04-11 | Happy Valley | Grass | 114 | 1736 | |
2018-09-22 | Sha Tin | Grass | 124 | 1736 | 107 |
As you can see, there are spaces for the Last FGrating at Sha Tin, grass. These spaces have two causes:
- There is no last FGrating (the first record);
- There is a last FGrating for this track, but it was not written when the race for the same horse was on another track (Happy Valley grass, in this table).
So far, I managed to do only the filling part, according to this code:
def return_mask_and_text_from_tracks(data, track_no, metric):
if track_no == 0: # Sha Tin - grass
mask = (data.Track == 'Sha Tin') & (data.Surface == 'Grass')
text = str(metric) ' at Sha-Tin Grass'
if track_no == 1: # Sha Tin - dirt
mask = (data.Track == 'Sha Tin') & (data.Surface == 'Dirt')
text = str(metric) ' at Sha-Tin Dirt'
if track_no == 2: # Happy Valley - grass
mask = (data.Track == 'Happy Valley') & (data.Surface == 'Grass')
text = str(metric) ' at Happy Valley Grass'
return mask, text
def compute_last_fgrating(data, mask=''):
if len(mask) == 0:
return data.groupby('HorseId')['FGrating'].apply(lambda x: x.shift(1))
else:
return data.loc[mask][['HorseId', 'FGrating']].groupby('HorseId')['FGrating'].apply(lambda x: x.shift(1))
for i in range(3):
mask, text = return_mask_and_text_from_tracks(featured_data, i, 'Last FGrating')
featured_data[text] = compute_last_fgrating(featured_data, mask=mask)
I am looking for ways to fill the rightmost column with the following values:
- 0, when no last FGrating exist (the first record, in this case)
- The last FGrating on Sha Tin grass on the records where the race was on other tracks, for the same horse.
The final table should look like this:
Date | Track | Surface | FGrating | HorseId | Last FGrating at Sha Tin, grass |
---|---|---|---|---|---|
2017-09-03 | Sha Tin | Grass | 110 | 1736 | 0 (no previous FGrating exists) |
2017-09-16 | Sha Tin | Grass | 124 | 1736 | 110 |
2017-10-14 | Sha Tin | Grass | 118 | 1736 | 124 |
2017-11-11 | Sha Tin | Grass | 107 | 1736 | 118 |
2018-03-28 | Happy Valley | Grass | 117 | 1736 | 118 (the lastFGrating on Sha Tin grass at this point) |
2018-04-11 | Happy Valley | Grass | 114 | 1736 | 118 |
2018-09-22 | Sha Tin | Grass | 124 | 1736 | 107 |
How can I do it?
CodePudding user response:
Here you only need
df["Last FGrating at Sha Tin, grass"] = df["Last FGrating at Sha Tin, grass"].fillna(method="ffill").fillna(0)
Explanation
The first fillna
fill NaN with the latest value available (method="ffill"
) then you have left only one NaN on your first row and you can fill it with 0
.