Home > Back-end >  Pandas : fill NaN with the closest value, according to a category column
Pandas : fill NaN with the closest value, according to a category column

Time:11-08

Let's take a sample dataframe :

df = pd.DataFrame({"Date": ["2022-10-01","2022-10-02","2022-10-03","2022-10-04","2022-10-05","2022-10-06","2022-10-01","2022-10-02","2022-10-03","2022-10-04","2022-10-05","2022-10-06"],
                   "Animal" :["Cat","Cat","Cat","Cat","Cat","Cat","Dog","Dog","Dog","Dog","Dog","Dog"],
                   "Quantity":[np.nan,4,3,5,1,np.nan,6,5,np.nan,np.nan,2,1]})

          Date Animal  Quantity
0   2022-10-01    Cat       NaN
1   2022-10-02    Cat       4.0
2   2022-10-03    Cat       3.0
3   2022-10-04    Cat       5.0
4   2022-10-05    Cat       1.0
5   2022-10-06    Cat       NaN
6   2022-10-01    Dog       6.0
7   2022-10-02    Dog       5.0
8   2022-10-03    Dog       NaN
9   2022-10-04    Dog       NaN
10  2022-10-05    Dog       2.0
11  2022-10-06    Dog       1.0

I would like to fill the NaN values in the column Quantity using the following method :

  • Replace the NaN values with the closest value that is before the NaN value and which share the same value in Animal column
  • If there is still some NaN values, replace the remaining NaN values with the closest value that is after the Nan value and which share the same value in Animal column

I thought to Series.interpolate but I don't know how to deal with the Animal column. Would you please know an efficient way to reach the expected output ?

Expected output :

          Date Animal  Quantity
0   2022-10-01    Cat         4
1   2022-10-02    Cat         4
2   2022-10-03    Cat         3
3   2022-10-04    Cat         5
4   2022-10-05    Cat         1
5   2022-10-06    Cat         1
6   2022-10-01    Dog         6
7   2022-10-02    Dog         5
8   2022-10-03    Dog         5
9   2022-10-04    Dog         5
10  2022-10-05    Dog         2
11  2022-10-06    Dog         1
``

CodePudding user response:

You can use ffill/bfill per group:

df['Quantity'] = (df.groupby('Animal', group_keys=False)['Quantity']
                    .apply(lambda s: s.bfill().ffill())
                 )

Output:

          Date Animal  Quantity
0   2022-10-01    Cat       4.0
1   2022-10-02    Cat       4.0
2   2022-10-03    Cat       3.0
3   2022-10-04    Cat       5.0
4   2022-10-05    Cat       1.0
5   2022-10-06    Cat       1.0
6   2022-10-01    Dog       6.0
7   2022-10-02    Dog       5.0
8   2022-10-03    Dog       5.0
9   2022-10-04    Dog       5.0
10  2022-10-05    Dog       2.0
11  2022-10-06    Dog       1.0

CodePudding user response:

Addition tp @mozway sol (as I cant edit it). I've absorved the column should be typecasted to int

df['Quantity'] = (df.groupby('Animal', group_keys=False)['Quantity']
                    .apply(lambda s: s.bfill().ffill())
                 )

df['Quantity'] = df['Quantity'].astype('int')
print(df)

output #

          Date Animal  Quantity
0   2022-10-01    Cat         4
1   2022-10-02    Cat         4
2   2022-10-03    Cat         3
3   2022-10-04    Cat         5
4   2022-10-05    Cat         1
5   2022-10-06    Cat         1
6   2022-10-01    Dog         6
7   2022-10-02    Dog         5
8   2022-10-03    Dog         2
9   2022-10-04    Dog         2
10  2022-10-05    Dog         2
11  2022-10-06    Dog         1
  • Related