I have a dataframe with som "NaN" and Outlier values which I want to fill with the mean value of the specific month.
df[["arrival_day", "ib_units", "month","year"]]
arrival_day ib_units month year
37 2020-01-01 262 1 2020
235 2020-01-02 2301 1 2020
290 2020-01-02 145 1 2020
476 2020-01-02 6584 1 2020
551 2020-01-02 30458 1 2020
... ... ... ... ...
1479464 2022-07-19 56424 7 2022
1479490 2022-07-19 130090 7 2022
1479510 2022-07-19 3552 7 2022
1479556 2022-07-19 23779 7 2022
1479756 2022-07-20 2882 7 2022
I know there is the pandas.DataFrame.fillna function df.fillna(df.mean()), but in this case it would build the overall mean for the whole dataset. I want to fill the "NaNs" with the mean value of the specific month in this specific year.
This is what I have tried but this solution is not straightforward and only calculates the mean by year and not the mean by month:
mask_2020 = (df['arrival_day'] >= '2020-01-01') & (df['arrival_day'] <= '2020-12-31')
df_2020 = df.loc[mask_2020]
mask_2021 = (df['arrival_day'] >= '2021-01-01') & (df['arrival_day'] <= '2021-12-31')
df_2021 = df.loc[mask_2021]
mask_2022 = (df['arrival_day'] >= '2022-01-01') & (df['arrival_day'] <= '2022-12-31')
df_2022 = df.loc[mask_2022]
mean_2020 = df_2020.ib_units.mean()
mean_2021 = df_2021.ib_units.mean()
mean_2022 = df_2022.ib_units.mean()
# this finds quartile outliers and replaces them with the mean value of the specific year
for x in ['ib_units']:
q75,q25 = np.percentile(df_2020.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75 (1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2020.loc[df_2020[x] < min,x] = mean_2020
df_2020.loc[df_2020[x] > max,x] = mean_2020
for x in ['ib_units']:
q75,q25 = np.percentile(df_2021.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75 (1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2021.loc[df_2021[x] < min,x] = mean_2021
df_2021.loc[df_2021[x] > max,x] = mean_2021
for x in ['ib_units']:
q75,q25 = np.percentile(df_2022.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75 (1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2022.loc[df_2022[x] < min,x] = mean_2022
df_2022.loc[df_2022[x] > max,x] = mean_2022
So how can I do this in a more code effective way and also by month and not by year? Thanks!
CodePudding user response:
I think you are overthinking. Please see the code below if it works for you. For outlier the code should be the same as the filling the N/A
import pandas as pd
import datetime as dt
# Sample Data:
df = pd.DataFrame({'date': ['2000-01-02', '2000-01-02', '2000-01-15', '2000-01-27',
'2000-06-03', '2000-06-29', '2000-06-15', '2000-06-29',
'2001-01-02', '2001-01-02', '2001-01-15', '2001-01-27'],
'val':[5,7,None,4,
8,1,None,9,
2,3,None,7]})
# Some convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# Create mean value:
tem = df.groupby(['year', 'month'])[['val']].mean().reset_index()
tem.rename(columns={'val': 'val_mean'}, inplace=True)
tem
# Merge and fill NA:
df = pd.merge(df, tem, how='left', on=['year', 'month'])
df.loc[df['val'].isna(),'val'] = df['val_mean']