Say I have the following dataframe:
values
0 4
1 0
2 2
3 3
4 0
5 8
6 5
7 1
8 0
9 4
10 7
I want to find a pandas vectorized function (preferably using groupby) that would replace all nonzero values with the first nonzero value in that chunk of nonzero values, i.e. something that would give me
values new
0 4 4
1 0 0
2 2 2
3 3 2
4 0 0
5 8 8
6 5 8
7 1 8
8 0 0
9 4 4
10 7 4
Is there a good way of achieving this?
CodePudding user response:
Make a boolean mask to select the rows having zero and its following row, then use this boolean mask with where
to replace remaining values with NaN
, then use forward fill to propagate the values in forward direction.
m = df['values'].eq(0)
df['new'] = df['values'].where(m | m.shift()).ffill().fillna(df['values'])
Result
print(df)
values new
0 4 4.0
1 0 0.0
2 2 2.0
3 3 2.0
4 0 0.0
5 8 8.0
6 5 8.0
7 1 8.0
8 0 0.0
9 4 4.0
10 7 4.0
CodePudding user response:
The following function should do the job for you. Check the comments in the function to understand the work flow of the solution.
import pandas as pd
def ffill_nonZeros(values):
# get the values that are not equal to 0
non_zero = values[df['values'] != 0]
# get their indexes
non_zero_idx = non_zero.index.to_series()
# find where indexes are consecutive
diff = non_zero_idx.diff()
mask = diff == 1
# using the mask make all places in non_zero where the change is consecutive equal None
non_zero[mask] = None
# fill forward (replace all None values with previous valid value)
new_non_zero = non_zero.fillna(method='ffill')
# put new values back in their indexs
new = values.copy()
new[new_non_zero.index] = new_non_zero
return new
Now applying this function to your data:
df = pd.DataFrame([4, 0, 2, 3, 0, 8, 5, 1, 0, 4, 7], columns=['values'])
df['new'] = ffill_nonZeros(df['values'])
print(df)
Output:
values new
0 4 4
1 0 0
2 2 2
3 3 2
4 0 0
5 8 8
6 5 8
7 1 8
8 0 0
9 4 4
10 7 4
CodePudding user response:
get rows for zeros, and the rows immediately after:
zeros = df.index[df['values'].eq(0)]
after_zeros = zeros.union(zeros 1)
Get the rows that need to be forward filled:
replace = df.index.difference(after_zeros)
replace = replace[replace > zeros[0]]
Set values and forward fill on replace
:
df['new'] = df['values']
df.loc[replace, 'new'] = np.nan
df.ffill()
values new
0 4 4.0
1 0 0.0
2 2 2.0
3 3 2.0
4 0 0.0
5 8 8.0
6 5 8.0
7 1 8.0
8 0 0.0
9 4 4.0
10 7 4.0