I have a dataframe that has 2 columns, date and values. I want to replace NaN values in the dataframe with mean values, but with specific condition.
import pandas as pd
dates = ["2022-02-01", "2022-02-02", "2022-02-03", "2022-02-04", "2022-02-05", "2022-02-06",
"2021-02-01", "2021-02-02", "2021-02-03", "2021-02-04", "2021-02-05", "2021-02-06"]
values = [3,1,6,2,5,7,3, None, 3, None, None, None]
df = pd.DataFrame({"date": dates,
"values": values
})
df = df.sort_values(by = 'date', ascending = False)
print(df)
date values
0 2022-02-06 7.0
1 2022-02-05 5.0
2 2022-02-04 2.0
3 2022-02-03 6.0
4 2022-02-02 1.0
5 2022-02-01 3.0
6 2021-02-06 NaN
7 2021-02-05 NaN
8 2021-02-04 NaN
9 2021-02-03 3.0
10 2021-02-02 NaN
11 2021-02-01 3.0
NaN values should be replaced with mean value of the values from the same period for the year that has that value ( /- 1 day).
Value for 2021-02-04
should be:
(6 2 5) / 3 = 4.33
Because dates "2022-02-03", "2022-02-04", "2022-02-05"
have values of 6,2 and 5.
I know how to fill nan values with just mean value, but I do not know how to solve the problem of /- one day.
CodePudding user response:
Really hard to say what exactly you want to do, but given your data:
import pandas as pd
import numpy as np
dates = ["2022-02-01", "2022-02-02", "2022-02-03", "2022-02-04", "2022-02-05", "2022-02-06",
"2021-02-01", "2021-02-02", "2021-02-03", "2021-02-04", "2021-02-05", "2021-02-06"]
values = [3,1,6,2,5,7,3, None, 3, None, None, None]
df = pd.DataFrame({"date": dates,
"values": values
})
df = df.sort_values(by = 'date', ascending = False).reset_index(drop=True)
You can try something like this:
def process_data(dates, values):
new_values = np.copy(values)
indices = np.argwhere(np.isnan(values))
dates_without_year = ['-'.join(d.split('-')[1:]) for d in dates.astype(str)]
for i, d in enumerate(dates):
if i in indices:
possible_dates = np.array(pd.Series(pd.date_range(d - pd.Timedelta(days=1), d pd.Timedelta(days=1))), dtype='datetime64[D]')
possible_dates = ['-'.join(d.split('-')[1:]) for d in possible_dates.astype(str)]
mean_values = values[np.argwhere(np.isin(dates_without_year, possible_dates))]
new_values[i] = np.mean(mean_values[~np.isnan(mean_values)])
return new_values
df['values'] = process_data(np.array(df['date'].values, dtype='datetime64[D]'), df['values'].to_numpy())
date values
0 2022-02-06 7.000000
1 2022-02-05 5.000000
2 2022-02-04 2.000000
3 2022-02-03 6.000000
4 2022-02-02 1.000000
5 2022-02-01 3.000000
6 2021-02-06 6.000000
7 2021-02-05 4.666667
8 2021-02-04 4.000000
9 2021-02-03 3.000000
10 2021-02-02 3.200000
11 2021-02-01 3.000000
Take a close look at, for example, 2021-02-04
which had a NaN
value. I disregard the years (as mentioned in the comments) and just look at the months and days resulting in (6 2 5 3) / 4 = 4.0
, since "2022-02-03", "2022-02-04", "2022-02-05", and "2021-02-03" have values of 6, 2, 5, and 3.