Home > front end >  Replace nan values with the value of the closest date in a dataframe
Replace nan values with the value of the closest date in a dataframe

Time:12-09

My current problem goes as follows. Consider the dataframe:

       name     score   date
0      Alice    22.0    2020-12-31
15     Alice    4.0     2005-12-31
5      Alice    1.0     2003-12-31
10     Alice    NaN     2000-12-31
6      Bob      8.0     2001-11-02
16     Bob      1.0     2004-11-02
11     Bob      5.0     2003-11-02
1      Bob      45.0    1980-11-02
12     Chuck    9.0     2003-12-03
2      Chuck    4.0     2015-12-03
7      Chuck    9.0     2001-12-03
17     Chuck    2.0     2004-12-03
18     Daren    3.0     2004-03-13
21     Daren    89.0    2015-08-13
3      Daren    NaN     2015-03-13
20     Daren    12.0    2015-05-13
8      Daren    5.0     2015-03-17
13     Daren    93.0    2003-03-13
14     Elisa    11.0    2003-01-24
9      Elisa    19.0    2001-01-24
19     Elisa    23.0    2004-01-24
4      Elisa    16.0    2010-01-24

Some score values are nan. I would like to replace these values with the person's closest score (w.r.t. time), GIVEN that the time difference is within 1 year, i.e. 365 days. If there is no such time, then keep the value as nan.

In the dataframe above, Daren's nan score value would be replaced to 5, as there is a score which was recorded within a year of the nan date and it is the smallest time difference from the other possible scores. However for Alice, there is no such score which was recorded within a year of the nan score, thus her nan score will be kept as nan.

To reproduce the dataframe, use this code:

import numpy as np
import pandas as pd
import datetime

data = {"name": ['Alice', 'Bob', 'Chuck', 'Daren', 'Elisa', 'Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Daren','Daren'],
        "score": [22,45,4,np.nan,16,1,8,9,5,19,np.nan,5,9,93,11,4,1,2,3,23,12,89],
        "date": ['31/12/2020','11/02/1980','12/03/2015','13/03/2015','24/01/2010','31/12/2003','11/02/2001','12/03/2001','17/03/2015','24/01/2001','31/12/2000','11/02/2003','12/03/2003','13/03/2003','24/01/2003','31/12/2005','11/02/2004','12/03/2004','13/03/2004','24/01/2004','13/05/2015','13/08/2015']}
df = pd.DataFrame(data = data)
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by='name')

CodePudding user response:

Assuming df is the same dataframe as given in the question above:

df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y') # Create a datetime out of date column
df = df.sort_values('date').reset_index(drop=True) # Sorting the dates to get closest values from NaN
nan_index = df.index[df['score'].isna()].to_list() # Get all index values where value is NaN
for idx in nan_index: # Loop over all NaN values and check condition
    if idx-1 < 0: # If no time available previously, we skip such cases as there is no score within a year.
        continue
    curr_date = df.at[idx, 'date']
    prev_date = df.at[idx-1, 'date']
    if ((curr_date - prev_date).days <= 365): # Check if time difference is within 365 days
        df.at[idx, 'score'] = df.at[idx 1, 'score'] # Replace NaN with next closest value
print(df)

CodePudding user response:

First sort values based on names and extract year from date column. Then group by year

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['name'])
df = df.sort_values(['date'])
df['year'] = pd.DatetimeIndex(df['date']).year
print(df.groupby('year').bfill())
df.sort_index(inplace=True)

Gives #

     name  score       date
0   Alice   22.0 2020-12-31
1     Bob   45.0 1980-11-02
2   Chuck    4.0 2015-12-03
3   Daren    5.0 2015-03-13
4   Elisa   16.0 2010-01-24
5   Alice    1.0 2003-12-31
6     Bob    8.0 2001-11-02
7   Chuck    9.0 2001-12-03
8   Daren    5.0 2015-03-17
9   Elisa   19.0 2001-01-24
10  Alice    NaN 2000-12-31
11    Bob    5.0 2003-11-02
12  Chuck    9.0 2003-12-03
13  Daren   93.0 2003-03-13
14  Elisa   11.0 2003-01-24
15  Alice    4.0 2005-12-31
16    Bob    1.0 2004-11-02
17  Chuck    2.0 2004-12-03
18  Daren    3.0 2004-03-13
19  Elisa   23.0 2004-01-24
20  Daren   12.0 2015-05-13
21  Daren   89.0 2015-08-13
>>> 
  • Related