Is there a way to add a column that indicates the next index that meets some condition (ex. first index where a future row's val
is greater than the current row's val
) using a vectorized approach?
I found a number of examples that show how to do this using a fixed value, such as getting the next index where a column is greater than 0
, but I am wanting to do this for every row based on that row's changing value.
Here's an example of doing this with simple loop, and I'm curious if there's a Pandas/vectorized approach to do the same:
import pandas as pd
df = pd.DataFrame( [0,2,3,2,3,4,5,6,5,4,7,8,7,2,3], columns=['val'], index=pd.date_range('20220101', periods=15))
def add_new_highs (df):
df['new_high'] = pd.NaT
for i,v in df.val.iteritems():
row = df.loc[i:][ df.val > v ].head(1)
if len(row) > 0:
df['new_high'].loc[i] = row.index[0]
add_new_highs(df)
print(df)
Output:
val new_high
2022-01-01 0 2022-01-02
2022-01-02 2 2022-01-03
2022-01-03 3 2022-01-06
2022-01-04 2 2022-01-05
2022-01-05 3 2022-01-06
2022-01-06 4 2022-01-07
2022-01-07 5 2022-01-08
2022-01-08 6 2022-01-11
2022-01-09 5 2022-01-11
2022-01-10 4 2022-01-11
2022-01-11 7 2022-01-12
2022-01-12 8 NaT
2022-01-13 7 NaT
2022-01-14 2 2022-01-15
2022-01-15 3 NaT
CodePudding user response:
One option is to use numpy broadcasting. Since we want the index that appears after the current index, we only need to look at the upper triangle of an array; so we use np.triu
. Then since we need the first such index, we use argmax
. Finally, for some indices, there might never be a greater than value, so we replace those with NaN using where
:
import numpy as np
df['new_high'] = df.index[np.triu(df[['val']].to_numpy() < df['val'].to_numpy()).argmax(axis=1)]
df['new_high'] = df['new_high'].where(lambda x: x.index < x)
Output:
val new_high
2022-01-01 0 2022-01-02
2022-01-02 2 2022-01-03
2022-01-03 3 2022-01-06
2022-01-04 2 2022-01-05
2022-01-05 3 2022-01-06
2022-01-06 4 2022-01-07
2022-01-07 5 2022-01-08
2022-01-08 6 2022-01-11
2022-01-09 5 2022-01-11
2022-01-10 4 2022-01-11
2022-01-11 7 2022-01-12
2022-01-12 8 NaT
2022-01-13 7 NaT
2022-01-14 2 2022-01-15
2022-01-15 3 NaT
CodePudding user response:
Similar to @enke's response
import numpy as np
arr = np.repeat(df.values, len(df), axis=1) # make a matrix
arr = np.tril(arr) # remove values before you
arr = (arr - df.values.T) > 0 # make bool array of larger values
ind = np.argmax(arr, axis=0) # get first larger value index
df['new_high'] = df.iloc[ind].index # use index as new row
df['new_high'] = df['new_high'].replace({df.index[0]: pd.NaT}) # replace ones with no-max as NaT