Home > Net >  Pandas DataFrame : Using Pandas Replace NaN Values with Average of above 3 rows
Pandas DataFrame : Using Pandas Replace NaN Values with Average of above 3 rows

Time:08-09

I am having Pandas DataFrame in that I have to replace NaN values from the Order_Values column by applying mean for the above 3 rows where NaN values are present by groupby Id column

Input Dataframe   
        Id     order_values
       1002       45
       1002       36
       1002       18
       1002       NaN    
       1002       NaN   
       1002       72
       1003       68
       1003       54
       1003       45
       1003       NaN
       1003       NaN
       1004       14
       1004       50
       1004       27
       1004       NaN


Output Dataframe   
        Id     order_values
       1002       45
       1002       36
       1002       18
       1002       33    [Calculation 18 36 45=99/3= 33 ]
       1002       29   [Calculation 33 18 36=99/3= 29 ]
       1002       72
       1003       68
       1003       54
       1003       45
       1003       55.6  [Calculation 45 54 68=167/3= 55.6  ]
       1003       51.53  [Calculation 45 54 68=154.6/3= 51.53 ]
       1004       14
       1004       50
       1004       27
       1004       30.33   [Calculation 27 50 14=91/3= 30.33 ]

CodePudding user response:

Something like this?:

df = pd.DataFrame({
    'id':[ 1002 ,1002 ,1002 ,1002, 1002,1002 ,1003 ,1003 ,1003 ,1003, 1003, 1004, 1004, 1004, 1004],
    'order_values': [45, 36, 18, None ,None, 72, 68, 54, 45, None, None, 14, 50, 27, None]
})


seq = list(range(len(df)))
window_size = 3

df['result'] = df['order_values']
df['seq'] = None
for i, s in enumerate(range(len(seq) - window_size)):
    df.iloc[i   window_size, 2] = df.iloc[seq[s: s   window_size], 1].mean()
    df.iloc[i   window_size, 3] = str(seq[s: s   window_size])

df:

    id  order_values  result  seq
0   1002    45.0    45.00   None
1   1002    36.0    36.00   None
2   1002    18.0    18.00   None
3   1002    NaN     33.00   [0, 1, 2]
4   1002    NaN     27.00   [1, 2, 3]
5   1002    72.0    18.00   [2, 3, 4]
6   1003    68.0    72.00   [3, 4, 5]
7   1003    54.0    70.00   [4, 5, 6]
8   1003    45.0    64.67   [5, 6, 7]
9   1003    NaN     55.67   [6, 7, 8]
10  1003    NaN     49.50   [7, 8, 9]
11  1004    14.0    45.00   [8, 9, 10]
12  1004    50.0    14.00   [9, 10, 11]
13  1004    27.0    32.00   [10, 11, 12]
14  1004    NaN     30.33   [11, 12, 13]

CodePudding user response:

You can do like this:

s = df['order_values'].copy()
for i in range(3, len(s)):
    s.iloc[i] = s.iloc[i-3:i].mean() if pd.isna(s.iloc[i]) else s.iloc[i]

df['order_values'] = s

print(df):

      Id  order_values
0   1002     45.000000
1   1002     36.000000
2   1002     18.000000
3   1002     33.000000
4   1002     29.000000
5   1002     72.000000
6   1003     68.000000
7   1003     54.000000
8   1003     45.000000
9   1003     55.666667
10  1003     51.555556
11  1004     14.000000
12  1004     50.000000
13  1004     27.000000
14  1004     30.333333

CodePudding user response:

you could try using shift

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "id": [1002, 1002, 1002, 1002, 1002, 1002, 1003, 1003, 1003, 1003, 1003, 1004, 1004, 1004, 1004],
        "order_values": [45, 36, 18, None, None, 72, 68, 54, 45, None, None, 14, 50, 27, None],
    }
)
order = df["order_values"]
mask = order.isnull()

a = np.array([order[~mask].shift(periods=i) for i in range(1, 4)]).T[3:-2]

df.loc[mask, "order_values"] = np.sum(a, axis=1)
print(df)
      id  order_values
0   1002          45.0
1   1002          36.0
2   1002          18.0
3   1002          99.0
4   1002         126.0
5   1002          72.0
6   1003          68.0
7   1003          54.0
8   1003          45.0
9   1003         158.0
10  1003         194.0
11  1004          14.0
12  1004          50.0
13  1004          27.0
14  1004         167.0
  • Related