I am trying to add a new column to a dataframe that contains the index of the last occurrence of a value in a certain column.
Say we have the dataframe df
0
0 1
1 5
2 4
3 1
4 7
5 9
I want to add a 2nd column that contains the index of the last occurrence of 1 in column 0.
So the desired output becomes:
0 1
0 1 0
1 5 0
2 4 0
3 1 3
4 7 3
5 9 3
I have a working solution with a loop:
for i in df.index:
sub_df = df[0: i 1]
recent_1_index = sub_df[(sub_df[0] == 1)].index[-1]
df.at[i, 1] = recent_1_index
However, this solution is slow for a large dataframe as I believe it runs in O(n*m) with n being the length of the df
and m the length of the sub_df
? Does anyone have a solution that does not use the loop and is significantly faster?
Thanks!
CodePudding user response:
You can find index it match at least one value per row by 1
and convert another values to NaN
, last forward filling missing values (if not matched first rows are generated NaN
s):
df[4] = df.index.to_series().where(df[0].eq(1)).ffill()
print (df)
0 1 2 4
0 1 2 1 0
1 4 7 7 0
2 4 3 9 0
3 1 9 4 3
4 4 2 6 3
5 4 5 9 3
Explanation:
print (df[0].eq(1))
0 True
1 False
2 False
3 True
4 False
5 False
dtype: bool
print (df.index.to_series().where(df[0].eq(1)))
0 0.0
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
dtype: float64
print (df.index.to_series().where(df[0].eq(1)).ffill())
0 0.0
1 0.0
2 0.0
3 3.0
4 3.0
5 3.0
dtype: float64