I am having Pandas DataFrame in that I have to replace NaN values from the Order_Values column by applying mean for the above 3 rows where NaN values are present by groupby Id column
Input Dataframe
Id order_values
1002 45
1002 36
1002 18
1002 NaN
1002 NaN
1002 72
1003 68
1003 54
1003 45
1003 NaN
1003 NaN
1004 14
1004 50
1004 27
1004 NaN
Output Dataframe
Id order_values
1002 45
1002 36
1002 18
1002 33 [Calculation 18 36 45=99/3= 33 ]
1002 29 [Calculation 33 18 36=99/3= 29 ]
1002 72
1003 68
1003 54
1003 45
1003 55.6 [Calculation 45 54 68=167/3= 55.6 ]
1003 51.53 [Calculation 45 54 68=154.6/3= 51.53 ]
1004 14
1004 50
1004 27
1004 30.33 [Calculation 27 50 14=91/3= 30.33 ]
CodePudding user response:
Something like this?:
df = pd.DataFrame({
'id':[ 1002 ,1002 ,1002 ,1002, 1002,1002 ,1003 ,1003 ,1003 ,1003, 1003, 1004, 1004, 1004, 1004],
'order_values': [45, 36, 18, None ,None, 72, 68, 54, 45, None, None, 14, 50, 27, None]
})
seq = list(range(len(df)))
window_size = 3
df['result'] = df['order_values']
df['seq'] = None
for i, s in enumerate(range(len(seq) - window_size)):
df.iloc[i window_size, 2] = df.iloc[seq[s: s window_size], 1].mean()
df.iloc[i window_size, 3] = str(seq[s: s window_size])
df:
id order_values result seq
0 1002 45.0 45.00 None
1 1002 36.0 36.00 None
2 1002 18.0 18.00 None
3 1002 NaN 33.00 [0, 1, 2]
4 1002 NaN 27.00 [1, 2, 3]
5 1002 72.0 18.00 [2, 3, 4]
6 1003 68.0 72.00 [3, 4, 5]
7 1003 54.0 70.00 [4, 5, 6]
8 1003 45.0 64.67 [5, 6, 7]
9 1003 NaN 55.67 [6, 7, 8]
10 1003 NaN 49.50 [7, 8, 9]
11 1004 14.0 45.00 [8, 9, 10]
12 1004 50.0 14.00 [9, 10, 11]
13 1004 27.0 32.00 [10, 11, 12]
14 1004 NaN 30.33 [11, 12, 13]
CodePudding user response:
You can do like this:
s = df['order_values'].copy()
for i in range(3, len(s)):
s.iloc[i] = s.iloc[i-3:i].mean() if pd.isna(s.iloc[i]) else s.iloc[i]
df['order_values'] = s
print(df):
Id order_values
0 1002 45.000000
1 1002 36.000000
2 1002 18.000000
3 1002 33.000000
4 1002 29.000000
5 1002 72.000000
6 1003 68.000000
7 1003 54.000000
8 1003 45.000000
9 1003 55.666667
10 1003 51.555556
11 1004 14.000000
12 1004 50.000000
13 1004 27.000000
14 1004 30.333333
CodePudding user response:
you could try using shift
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"id": [1002, 1002, 1002, 1002, 1002, 1002, 1003, 1003, 1003, 1003, 1003, 1004, 1004, 1004, 1004],
"order_values": [45, 36, 18, None, None, 72, 68, 54, 45, None, None, 14, 50, 27, None],
}
)
order = df["order_values"]
mask = order.isnull()
a = np.array([order[~mask].shift(periods=i) for i in range(1, 4)]).T[3:-2]
df.loc[mask, "order_values"] = np.sum(a, axis=1)
print(df)
id order_values
0 1002 45.0
1 1002 36.0
2 1002 18.0
3 1002 99.0
4 1002 126.0
5 1002 72.0
6 1003 68.0
7 1003 54.0
8 1003 45.0
9 1003 158.0
10 1003 194.0
11 1004 14.0
12 1004 50.0
13 1004 27.0
14 1004 167.0