Home > Software design >  How do I find last index of true value in a dataframe when applying condition to each row in an effi
How do I find last index of true value in a dataframe when applying condition to each row in an effi

Time:02-23

Let us say I have pandas dataframe having two columns, previous and current. We can assume that values are non-decreasing and current values are always greater than previous value.

Now, for each element in previous column, I want to look up index of last value of current column which is less than this value. I then want to subtract that index from the this element's index and store that value in the new column, say numIndexes

working but inefficient code is as follows:

df = pd.DataFrame({'previous': [1,3,5,7,9,11,13,17],'current': [2,6,9,10,15,19,20,21]})
df['numIndexes']=1
for i in range(len(df)):
    x=df['previous'][i]>df['current']
    df['numIndexes'][i]=i-x[::-1].idxmax()

OUTPUT
    previous    current numIndexes
0   1   2   -7
1   3   6   1
2   5   9   2
3   7   10  2
4   9   15  3
5   11  19  2
6   13  20  3
7   17  21  3

Ignore the first negative value.

To explain my objective via example above, for 5th index, we have previous value as 11. Now in the current column, last index where current value is less than 11 is index 3. This gives me numIndexes for 5th row as 2 ( 5-3)

For a large dataset, this method is extremely slow. Any help appreciated to speed up this logic.

EDIT : The assumption of strictly increasing values is not correct. Values are non-decreasing. However, each previous value is strictly less than its corresponding current value

CodePudding user response:

Since the values are non-decreasing, you can use numpy.broadcasting, [:, None], to compare the current values with all previous values. We then take the sum and subtract 1 since counting starts at 0, giving us the index position of the last row with current value < the previous value for all rows in the DataFrame.

Then create your column, which is the index minus the index of this calculated row.

ilocs = (df['current'].to_numpy()[:, None] < df['previous'].to_numpy()).sum(0) - 1
df['numIndexes'] = df.index  - df.index[ilocs]

   previous  current  numIndexes
0         1        2          -7
1         3        6           1
2         5        9           2
3         7       10           2
4         9       15           3
5        11       19           2
6        13       20           3
7        17       21           3

The above is memory intensive. If that doesn't work you can use an asof merge to match on the last row with the current value < previous value. We bring along the index so you can then perform the subtraction afterwards. I've left the additional columns showing the value it matched and the index it matched in for illustration -- drop them if you don't care.

import pandas as pd

df = pd.merge_asof(df, df[['current']].reset_index(),
                   left_on='previous', right_on='current',
                   suffixes=['', '_match'],
                   allow_exact_matches=False  # Require strictly less than
                  )

df['numIndexes'] = df.index - df['index']

   previous  current  index  current_match  numIndexes
0         1        2    NaN            NaN         NaN
1         3        6    0.0            2.0         1.0
2         5        9    0.0            2.0         2.0
3         7       10    1.0            6.0         2.0
4         9       15    1.0            6.0         3.0
5        11       19    3.0           10.0         2.0
6        13       20    3.0           10.0         3.0
7        17       21    4.0           15.0         3.0

CodePudding user response:

I am doing the subtract.outer, with argmax

df.index - np.cumsum(np.subtract.outer(df['previous'].values,df['current'].values),axis=1).argmax(axis=1)
Out[278]: Int64Index([0, 1, 2, 2, 3, 2, 3, 3], dtype='int64')
  • Related