Replace negative numbers, NaN and 0s with mean of next and previous positive number-CodePudding

I want to replace negative numbers, NaNs and 0s with mean of next and previous positive number of same column.

Original dataframe

    a   c
0   1   1
1   2   2
2   0   5
3   -3  NaN
4   -1  5
5   3   3

Expected output dataframe is

    a    c
0   1     1
1   2     2
2   2.5   5    #In Col a --> Mean of 2 and 3 is 2.5 hence 0 replaced by 2.5
3   2.75  2.5  #In Col a --> Mean of 2.5 and 3 is 2.75 hence negative no. replaced by 2.75
4   2.875 5    #In Col a --> Mean of 2.75 and 3 is 2.875 hence negative no. replaced by 2.875
5   3     3

I tried another strategy to deal with negative no. Nan and 0 is replacing it with mean of previous 3 values

m = df['a'] < 1
new = (df.loc[~m, 'a'].astype(float)
         .rolling(2, min_periods=1).mean()
         .reindex(df.index, method='ffill'))

df['a'].mask(m, new)

which results in

0    1.0
1    2.0
2    1.5
3    1.5
4    1.5
5    2.0
Name: a, dtype: float64

However I'm struggling to implement the new strategy (asked).

CodePudding user response：

I edited my answer to better address your question. Note however, that the mean of 5 and 5 is 5, and not 2.5 as you wrote in yout expected result.

This new answer is based on hpchavaz's answer below.

# Replace 0 and negative values with NaN
df = df.mask(df<=0)

# Compute rank of consecutive NaN values
rank = df.isnull().astype('int')
rank = rank.cumsum() - rank.cumsum().where(rank==0).ffill().fillna(0)
print(rank)

     a    b
0  0.0  0.0
1  0.0  0.0
2  1.0  0.0
3  2.0  1.0
4  3.0  0.0
5  0.0  0.0

# Compute first and last non null value before NaN range
first = df.ffill()
last = df.bfill()

# Finally, compute final df
df = last - (last-first)/2**(rank)
print(df)

       a    b
0  1.000  1.0
1  2.000  2.0
2  2.500  5.0
3  2.750  5.0
4  2.875  5.0
5  3.000  3.0

Previous answer

You can call mask to replace null and negative values with NaN, then interpolate

Not really sure why you expect the NaN in your second column to be replaced with 2.5 instead of 5 though...

>>> df.mask(df<=0).interpolate()
      a    b
0  1.00  1.0
1  2.00  2.0
2  2.25  5.0
3  2.50  5.0
4  2.75  5.0
5  3.00  3.0

CodePudding user response：

Heavily edited version

Define a function to return a corrected version of a series

build a mask : m
modify the serie setting by NaN for values to be replaced
build a series from mask with 0: 1: tocorrect
build a series of the ranks for each sequence to be corrected : sequence_ranks
build lower and upper values series using ffill and bfill : 'lower_bound_values' and 'upper_bound_values '
calculate the result using 'lower_bound_values ' 'upper_bound_values

def correct_series_values(ser):
    m = (ser<=0) | ser.isna()
    ser[m] = np.nan
    tocorrect = m.astype('int')
    sequence_ranks = tocorrect.cumsum() - tocorrect.cumsum().where(tocorrect==0).ffill().fillna(0)
    lower_bound_values, upper_bound_values = ser.ffill(), ser.bfill()
    ser[m] = (lower_bound_values[m] - upper_bound_values[m]) / 2**(sequence_ranks[m])   upper_bound_values[m]
    return ser

df['a'] = correct_series_values(df['a'])
df['c'] = correct_series_values(df['c'])

DF:

        a     c
0   1.000   1.0
1   2.000   2.0
2   2.500   5.0
3   2.750   5.0
4   2.875   5.0
5   3.000   3.0