Home > OS >  Replacing NaN values in timeseries Pandas dataframe with mean values
Replacing NaN values in timeseries Pandas dataframe with mean values

Time:02-14

I have a dataframe that has 2 columns, date and values. I want to replace NaN values in the dataframe with mean values, but with specific condition.

import pandas as pd

dates = ["2022-02-01", "2022-02-02", "2022-02-03", "2022-02-04", "2022-02-05", "2022-02-06",
         "2021-02-01", "2021-02-02", "2021-02-03", "2021-02-04", "2021-02-05", "2021-02-06"]
         
values = [3,1,6,2,5,7,3, None, 3, None, None, None]         

df = pd.DataFrame({"date": dates,
                   "values": values
                    })

df = df.sort_values(by = 'date', ascending = False)
print(df)

          date  values
0   2022-02-06     7.0
1   2022-02-05     5.0
2   2022-02-04     2.0
3   2022-02-03     6.0
4   2022-02-02     1.0
5   2022-02-01     3.0
6   2021-02-06     NaN
7   2021-02-05     NaN
8   2021-02-04     NaN
9   2021-02-03     3.0
10  2021-02-02     NaN
11  2021-02-01     3.0

NaN values should be replaced with mean value of the values from the same period for the year that has that value ( /- 1 day).

Value for 2021-02-04 should be:

(6   2   5) / 3 = 4.33

Because dates "2022-02-03", "2022-02-04", "2022-02-05" have values of 6,2 and 5.

I know how to fill nan values with just mean value, but I do not know how to solve the problem of /- one day.

CodePudding user response:

Really hard to say what exactly you want to do, but given your data:

import pandas as pd
import numpy as np

dates = ["2022-02-01", "2022-02-02", "2022-02-03", "2022-02-04", "2022-02-05", "2022-02-06",
         "2021-02-01", "2021-02-02", "2021-02-03", "2021-02-04", "2021-02-05", "2021-02-06"]
         
values = [3,1,6,2,5,7,3, None, 3, None, None, None]         

df = pd.DataFrame({"date": dates,
                   "values": values
                    })
df = df.sort_values(by = 'date', ascending = False).reset_index(drop=True)

You can try something like this:

def process_data(dates, values):
  new_values = np.copy(values)
  indices = np.argwhere(np.isnan(values))
  dates_without_year = ['-'.join(d.split('-')[1:]) for d in dates.astype(str)]
  for i, d in enumerate(dates):
    if i in indices:
      possible_dates = np.array(pd.Series(pd.date_range(d - pd.Timedelta(days=1), d   pd.Timedelta(days=1))), dtype='datetime64[D]')
      possible_dates = ['-'.join(d.split('-')[1:]) for d in possible_dates.astype(str)]
      mean_values = values[np.argwhere(np.isin(dates_without_year, possible_dates))]
      new_values[i] = np.mean(mean_values[~np.isnan(mean_values)])
  return new_values
df['values'] = process_data(np.array(df['date'].values, dtype='datetime64[D]'), df['values'].to_numpy())
          date    values
0   2022-02-06  7.000000
1   2022-02-05  5.000000
2   2022-02-04  2.000000
3   2022-02-03  6.000000
4   2022-02-02  1.000000
5   2022-02-01  3.000000
6   2021-02-06  6.000000
7   2021-02-05  4.666667
8   2021-02-04  4.000000
9   2021-02-03  3.000000
10  2021-02-02  3.200000
11  2021-02-01  3.000000

Take a close look at, for example, 2021-02-04 which had a NaN value. I disregard the years (as mentioned in the comments) and just look at the months and days resulting in (6 2 5 3) / 4 = 4.0, since "2022-02-03", "2022-02-04", "2022-02-05", and "2021-02-03" have values of 6, 2, 5, and 3.

  • Related