Pandas Dataframe - Fillna with Mean by Month-CodePudding

I have a dataframe with som "NaN" and Outlier values which I want to fill with the mean value of the specific month.

df[["arrival_day", "ib_units", "month","year"]]

            arrival_day     ib_units    month   year
 37         2020-01-01      262           1     2020
235         2020-01-02      2301          1     2020
290         2020-01-02      145           1     2020
476         2020-01-02      6584          1     2020
551         2020-01-02      30458         1     2020
    ...           ...       ...     ...     ...
1479464     2022-07-19      56424         7     2022
1479490     2022-07-19      130090        7     2022
1479510     2022-07-19      3552          7     2022
1479556     2022-07-19      23779         7     2022
1479756     2022-07-20      2882          7     2022

I know there is the pandas.DataFrame.fillna function df.fillna(df.mean()), but in this case it would build the overall mean for the whole dataset. I want to fill the "NaNs" with the mean value of the specific month in this specific year.

This is what I have tried but this solution is not straightforward and only calculates the mean by year and not the mean by month:

mask_2020 = (df['arrival_day'] >= '2020-01-01') & (df['arrival_day'] <= '2020-12-31')
df_2020 = df.loc[mask_2020]

mask_2021 = (df['arrival_day'] >= '2021-01-01') & (df['arrival_day'] <= '2021-12-31')
df_2021 = df.loc[mask_2021]

mask_2022 = (df['arrival_day'] >= '2022-01-01') & (df['arrival_day'] <= '2022-12-31')
df_2022 = df.loc[mask_2022]

mean_2020 = df_2020.ib_units.mean()
mean_2021 = df_2021.ib_units.mean()
mean_2022 = df_2022.ib_units.mean()

# this finds quartile outliers and replaces them with the mean value of the specific year
for x in ['ib_units']:
   q75,q25 = np.percentile(df_2020.loc[:,x],[75,25])
   intr_qr = q75-q25

   max = q75 (1.5*intr_qr)
   min = q25-(1.5*intr_qr)

   df_2020.loc[df_2020[x] < min,x] = mean_2020
   df_2020.loc[df_2020[x] > max,x] = mean_2020

for x in ['ib_units']:
   q75,q25 = np.percentile(df_2021.loc[:,x],[75,25])
   intr_qr = q75-q25

   max = q75 (1.5*intr_qr)
   min = q25-(1.5*intr_qr)

   df_2021.loc[df_2021[x] < min,x] = mean_2021
   df_2021.loc[df_2021[x] > max,x] = mean_2021

for x in ['ib_units']:
   q75,q25 = np.percentile(df_2022.loc[:,x],[75,25])
   intr_qr = q75-q25

   max = q75 (1.5*intr_qr)
   min = q25-(1.5*intr_qr)

   df_2022.loc[df_2022[x] < min,x] = mean_2022
   df_2022.loc[df_2022[x] > max,x] = mean_2022

So how can I do this in a more code effective way and also by month and not by year? Thanks!

CodePudding user response：

I think you are overthinking. Please see the code below if it works for you. For outlier the code should be the same as the filling the N/A

import pandas as pd
import datetime as dt

# Sample Data:
df = pd.DataFrame({'date': ['2000-01-02', '2000-01-02', '2000-01-15', '2000-01-27', 
                            '2000-06-03', '2000-06-29', '2000-06-15', '2000-06-29',
                            '2001-01-02', '2001-01-02', '2001-01-15', '2001-01-27'],
                   'val':[5,7,None,4,
                          8,1,None,9,
                          2,3,None,7]})

# Some convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

# Create mean value:
tem = df.groupby(['year', 'month'])[['val']].mean().reset_index()
tem.rename(columns={'val': 'val_mean'}, inplace=True)
tem

# Merge and fill NA:
df = pd.merge(df, tem, how='left', on=['year', 'month'])
df.loc[df['val'].isna(),'val'] = df['val_mean']