My current problem goes as follows. Consider the dataframe:
name score date
0 Alice 22.0 2020-12-31
15 Alice 4.0 2005-12-31
5 Alice 1.0 2003-12-31
10 Alice NaN 2000-12-31
6 Bob 8.0 2001-11-02
16 Bob 1.0 2004-11-02
11 Bob 5.0 2003-11-02
1 Bob 45.0 1980-11-02
12 Chuck 9.0 2003-12-03
2 Chuck 4.0 2015-12-03
7 Chuck 9.0 2001-12-03
17 Chuck 2.0 2004-12-03
18 Daren 3.0 2004-03-13
21 Daren 89.0 2015-08-13
3 Daren NaN 2015-03-13
20 Daren 12.0 2015-05-13
8 Daren 5.0 2015-03-17
13 Daren 93.0 2003-03-13
14 Elisa 11.0 2003-01-24
9 Elisa 19.0 2001-01-24
19 Elisa 23.0 2004-01-24
4 Elisa 16.0 2010-01-24
Some score
values are nan
. I would like to replace these values with the person's closest score (w.r.t. time
), GIVEN that the time difference is within 1 year, i.e. 365 days. If there is no such time, then keep the value as nan
.
In the dataframe above, Daren's nan
score value would be replaced to 5, as there is a score which was recorded within a year of the nan
date and it is the smallest time difference from the other possible scores. However for Alice, there is no such score which was recorded within a year of the nan
score, thus her nan
score will be kept as nan
.
To reproduce the dataframe, use this code:
import numpy as np
import pandas as pd
import datetime
data = {"name": ['Alice', 'Bob', 'Chuck', 'Daren', 'Elisa', 'Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Daren','Daren'],
"score": [22,45,4,np.nan,16,1,8,9,5,19,np.nan,5,9,93,11,4,1,2,3,23,12,89],
"date": ['31/12/2020','11/02/1980','12/03/2015','13/03/2015','24/01/2010','31/12/2003','11/02/2001','12/03/2001','17/03/2015','24/01/2001','31/12/2000','11/02/2003','12/03/2003','13/03/2003','24/01/2003','31/12/2005','11/02/2004','12/03/2004','13/03/2004','24/01/2004','13/05/2015','13/08/2015']}
df = pd.DataFrame(data = data)
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by='name')
CodePudding user response:
Assuming df is the same dataframe as given in the question above:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y') # Create a datetime out of date column
df = df.sort_values('date').reset_index(drop=True) # Sorting the dates to get closest values from NaN
nan_index = df.index[df['score'].isna()].to_list() # Get all index values where value is NaN
for idx in nan_index: # Loop over all NaN values and check condition
if idx-1 < 0: # If no time available previously, we skip such cases as there is no score within a year.
continue
curr_date = df.at[idx, 'date']
prev_date = df.at[idx-1, 'date']
if ((curr_date - prev_date).days <= 365): # Check if time difference is within 365 days
df.at[idx, 'score'] = df.at[idx 1, 'score'] # Replace NaN with next closest value
print(df)
CodePudding user response:
First sort values based on names
and extract year from date column. Then group by year
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['name'])
df = df.sort_values(['date'])
df['year'] = pd.DatetimeIndex(df['date']).year
print(df.groupby('year').bfill())
df.sort_index(inplace=True)
Gives #
name score date
0 Alice 22.0 2020-12-31
1 Bob 45.0 1980-11-02
2 Chuck 4.0 2015-12-03
3 Daren 5.0 2015-03-13
4 Elisa 16.0 2010-01-24
5 Alice 1.0 2003-12-31
6 Bob 8.0 2001-11-02
7 Chuck 9.0 2001-12-03
8 Daren 5.0 2015-03-17
9 Elisa 19.0 2001-01-24
10 Alice NaN 2000-12-31
11 Bob 5.0 2003-11-02
12 Chuck 9.0 2003-12-03
13 Daren 93.0 2003-03-13
14 Elisa 11.0 2003-01-24
15 Alice 4.0 2005-12-31
16 Bob 1.0 2004-11-02
17 Chuck 2.0 2004-12-03
18 Daren 3.0 2004-03-13
19 Elisa 23.0 2004-01-24
20 Daren 12.0 2015-05-13
21 Daren 89.0 2015-08-13
>>>