I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 15
6 NaN
7 NaN
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 5, 6 and 7 should be replaced with 5.
How should I do that?
CodePudding user response:
You can do:
m = (df_so['Column 1'].notna()) & (df_so['Column 1'].shift(-1).isna())
out = df.groupby(m.cumsum()).transform(lambda x: \
x.fillna(0).mean() if x.isna().any() else x)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 5.0
5 6 5.0
6 7 5.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m
in above code looks like:
0 False
1 False
2 True
3 False
4 True
5 False
6 False
now I want to form groups of rows that are ['True', <all Falses>]
because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup()
after groupby
on m.cumsum()
:
0 0
1 0
2 1
3 1
4 2
5 2
6 2
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any()
.
If the group has any NaN
value then assign mean
after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
CodePudding user response:
Why not using interpolate? There is a method=
s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx 1<len(df) and np.isnan(df.at[local_idx 1,col])):
local_idx =1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx 1)
for fillidx in range(idx, local_idx 1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0