Home > Net >  Replacing NaNs with Mean Value using Pandas
Replacing NaNs with Mean Value using Pandas

Time:02-01

Say I have a Dataframe called Data with shape (71067, 4):

       StartTime          EndDateTime        TradeDate  Values
0   2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01  -44.676
1   2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01  -36.113
2   2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01  -19.229
3   2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01  -23.606
4   2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01  -25.899
... ... ... ... ...
    2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30  -27.198
    2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30  -13.221
    2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30  -12.034
    2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30  -16.464
    2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30  -25.441
71067 rows × 4 columns

When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:

Data.isna().sum().sum()
> 1391

Shown here:

Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')

0   2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01  NaN
1   2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04  NaN
2   2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04  NaN
3   2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04  NaN
4   2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04  NaN
... ... ... ... ...
1386    2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06  NaN
1387    2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06  NaN
1388    2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22  NaN
1389    2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25  NaN
1390    2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25  NaN

Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:

Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means

    HH          Values
0   00:00:00    5.236811
1   00:30:00    2.056571
2   01:00:00    4.157455
3   01:30:00    2.339253
4   02:00:00    2.658238
5   02:30:00    0.230557
6   03:00:00    0.217599
7   03:30:00    -0.630243
8   04:00:00    -0.989919
9   04:30:00    -0.494372

For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?

Any help greatly appreciated.

CodePudding user response:

Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values

avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)
  • Related