Home > Software design >  Python Pandas, add column containing first index where future column value is greater than current r
Python Pandas, add column containing first index where future column value is greater than current r

Time:04-29

Is there a way to add a column that indicates the next index that meets some condition (ex. first index where a future row's val is greater than the current row's val) using a vectorized approach?

I found a number of examples that show how to do this using a fixed value, such as getting the next index where a column is greater than 0, but I am wanting to do this for every row based on that row's changing value.

Here's an example of doing this with simple loop, and I'm curious if there's a Pandas/vectorized approach to do the same:

import pandas as pd

df = pd.DataFrame( [0,2,3,2,3,4,5,6,5,4,7,8,7,2,3], columns=['val'], index=pd.date_range('20220101', periods=15))

def add_new_highs (df):

    df['new_high'] = pd.NaT
    for i,v in df.val.iteritems():
        row = df.loc[i:][ df.val > v ].head(1)
        if len(row) > 0:
            df['new_high'].loc[i] = row.index[0]

add_new_highs(df)
print(df)

Output:

            val   new_high
2022-01-01    0 2022-01-02
2022-01-02    2 2022-01-03
2022-01-03    3 2022-01-06
2022-01-04    2 2022-01-05
2022-01-05    3 2022-01-06
2022-01-06    4 2022-01-07
2022-01-07    5 2022-01-08
2022-01-08    6 2022-01-11
2022-01-09    5 2022-01-11
2022-01-10    4 2022-01-11
2022-01-11    7 2022-01-12
2022-01-12    8        NaT
2022-01-13    7        NaT
2022-01-14    2 2022-01-15
2022-01-15    3        NaT

CodePudding user response:

One option is to use numpy broadcasting. Since we want the index that appears after the current index, we only need to look at the upper triangle of an array; so we use np.triu. Then since we need the first such index, we use argmax. Finally, for some indices, there might never be a greater than value, so we replace those with NaN using where:

import numpy as np
df['new_high'] = df.index[np.triu(df[['val']].to_numpy() < df['val'].to_numpy()).argmax(axis=1)]
df['new_high'] = df['new_high'].where(lambda x: x.index < x)

Output:

            val   new_high
2022-01-01    0 2022-01-02
2022-01-02    2 2022-01-03
2022-01-03    3 2022-01-06
2022-01-04    2 2022-01-05
2022-01-05    3 2022-01-06
2022-01-06    4 2022-01-07
2022-01-07    5 2022-01-08
2022-01-08    6 2022-01-11
2022-01-09    5 2022-01-11
2022-01-10    4 2022-01-11
2022-01-11    7 2022-01-12
2022-01-12    8        NaT
2022-01-13    7        NaT
2022-01-14    2 2022-01-15
2022-01-15    3        NaT

CodePudding user response:

Similar to @enke's response

import numpy as np
arr = np.repeat(df.values, len(df), axis=1)  # make a matrix
arr = np.tril(arr)  # remove values before you
arr = (arr - df.values.T) > 0  # make bool array of larger values
ind = np.argmax(arr, axis=0)  # get first larger value index

df['new_high'] = df.iloc[ind].index  # use index as new row
df['new_high'] = df['new_high'].replace({df.index[0]: pd.NaT})  # replace ones with no-max as NaT
  • Related