I want to replace negative numbers, NaNs and 0s with mean of next and previous positive number of same column.
Original dataframe
a c
0 1 1
1 2 2
2 0 5
3 -3 NaN
4 -1 5
5 3 3
Expected output dataframe is
a c
0 1 1
1 2 2
2 2.5 5 #In Col a --> Mean of 2 and 3 is 2.5 hence 0 replaced by 2.5
3 2.75 2.5 #In Col a --> Mean of 2.5 and 3 is 2.75 hence negative no. replaced by 2.75
4 2.875 5 #In Col a --> Mean of 2.75 and 3 is 2.875 hence negative no. replaced by 2.875
5 3 3
I tried another strategy to deal with negative no. Nan and 0 is replacing it with mean of previous 3 values
m = df['a'] < 1
new = (df.loc[~m, 'a'].astype(float)
.rolling(2, min_periods=1).mean()
.reindex(df.index, method='ffill'))
df['a'].mask(m, new)
which results in
0 1.0
1 2.0
2 1.5
3 1.5
4 1.5
5 2.0
Name: a, dtype: float64
However I'm struggling to implement the new strategy (asked).
CodePudding user response:
I edited my answer to better address your question. Note however, that the mean of 5 and 5 is 5, and not 2.5 as you wrote in yout expected result.
This new answer is based on hpchavaz's answer below.
# Replace 0 and negative values with NaN
df = df.mask(df<=0)
# Compute rank of consecutive NaN values
rank = df.isnull().astype('int')
rank = rank.cumsum() - rank.cumsum().where(rank==0).ffill().fillna(0)
print(rank)
a b
0 0.0 0.0
1 0.0 0.0
2 1.0 0.0
3 2.0 1.0
4 3.0 0.0
5 0.0 0.0
# Compute first and last non null value before NaN range
first = df.ffill()
last = df.bfill()
# Finally, compute final df
df = last - (last-first)/2**(rank)
print(df)
a b
0 1.000 1.0
1 2.000 2.0
2 2.500 5.0
3 2.750 5.0
4 2.875 5.0
5 3.000 3.0
Previous answer
You can call mask
to replace null and negative values with NaN, then interpolate
Not really sure why you expect the NaN in your second column to be replaced with 2.5 instead of 5 though...
>>> df.mask(df<=0).interpolate()
a b
0 1.00 1.0
1 2.00 2.0
2 2.25 5.0
3 2.50 5.0
4 2.75 5.0
5 3.00 3.0
CodePudding user response:
Heavily edited version
Define a function to return a corrected version of a series
- build a mask : m
- modify the serie setting by NaN for values to be replaced
- build a series from mask with 0: 1: tocorrect
- build a series of the ranks for each sequence to be corrected : sequence_ranks
- build lower and upper values series using ffill and bfill : 'lower_bound_values' and 'upper_bound_values '
- calculate the result using 'lower_bound_values ' 'upper_bound_values
def correct_series_values(ser):
m = (ser<=0) | ser.isna()
ser[m] = np.nan
tocorrect = m.astype('int')
sequence_ranks = tocorrect.cumsum() - tocorrect.cumsum().where(tocorrect==0).ffill().fillna(0)
lower_bound_values, upper_bound_values = ser.ffill(), ser.bfill()
ser[m] = (lower_bound_values[m] - upper_bound_values[m]) / 2**(sequence_ranks[m]) upper_bound_values[m]
return ser
df['a'] = correct_series_values(df['a'])
df['c'] = correct_series_values(df['c'])
DF:
a c
0 1.000 1.0
1 2.000 2.0
2 2.500 5.0
3 2.750 5.0
4 2.875 5.0
5 3.000 3.0