Let us say I have pandas dataframe having two columns, previous and current. We can assume that values are non-decreasing and current values are always greater than previous value.
Now, for each element in previous column, I want to look up index of last value of current column which is less than this value. I then want to subtract that index from the this element's index and store that value in the new column, say numIndexes
working but inefficient code is as follows:
df = pd.DataFrame({'previous': [1,3,5,7,9,11,13,17],'current': [2,6,9,10,15,19,20,21]})
df['numIndexes']=1
for i in range(len(df)):
x=df['previous'][i]>df['current']
df['numIndexes'][i]=i-x[::-1].idxmax()
OUTPUT
previous current numIndexes
0 1 2 -7
1 3 6 1
2 5 9 2
3 7 10 2
4 9 15 3
5 11 19 2
6 13 20 3
7 17 21 3
Ignore the first negative value.
To explain my objective via example above, for 5th index, we have previous value as 11. Now in the current column, last index where current value is less than 11 is index 3. This gives me numIndexes for 5th row as 2 ( 5-3)
For a large dataset, this method is extremely slow. Any help appreciated to speed up this logic.
EDIT : The assumption of strictly increasing values is not correct. Values are non-decreasing. However, each previous value is strictly less than its corresponding current value
CodePudding user response:
Since the values are non-decreasing, you can use numpy.broadcasting, [:, None]
, to compare the current values with all previous values. We then take the sum and subtract 1 since counting starts at 0, giving us the index position of the last row with current value < the previous value for all rows in the DataFrame.
Then create your column, which is the index minus the index of this calculated row.
ilocs = (df['current'].to_numpy()[:, None] < df['previous'].to_numpy()).sum(0) - 1
df['numIndexes'] = df.index - df.index[ilocs]
previous current numIndexes
0 1 2 -7
1 3 6 1
2 5 9 2
3 7 10 2
4 9 15 3
5 11 19 2
6 13 20 3
7 17 21 3
The above is memory intensive. If that doesn't work you can use an asof
merge to match on the last row with the current value < previous value. We bring along the index so you can then perform the subtraction afterwards. I've left the additional columns showing the value it matched and the index it matched in for illustration -- drop them if you don't care.
import pandas as pd
df = pd.merge_asof(df, df[['current']].reset_index(),
left_on='previous', right_on='current',
suffixes=['', '_match'],
allow_exact_matches=False # Require strictly less than
)
df['numIndexes'] = df.index - df['index']
previous current index current_match numIndexes
0 1 2 NaN NaN NaN
1 3 6 0.0 2.0 1.0
2 5 9 0.0 2.0 2.0
3 7 10 1.0 6.0 2.0
4 9 15 1.0 6.0 3.0
5 11 19 3.0 10.0 2.0
6 13 20 3.0 10.0 3.0
7 17 21 4.0 15.0 3.0
CodePudding user response:
I am doing the subtract.outer
, with argmax
df.index - np.cumsum(np.subtract.outer(df['previous'].values,df['current'].values),axis=1).argmax(axis=1)
Out[278]: Int64Index([0, 1, 2, 2, 3, 2, 3, 3], dtype='int64')