efficient multiple first valid index algorithms in Pandas-CodePudding

One useful function in Pandas is the first_valid_index

One obvious question is that suppose I want to find multiple first_valid_index for my dataframe provided that the particular column in increasing (in this example, it is event_time):

df.where(df.event_time >= 83000000).first_valid_index()
df.where(df.event_time >= 90000000).first_valid_index()
df.where(df.event_time >= 93000000).first_valid_index()
df.where(df.event_time >= 100000000).first_valid_index()
df.where(df.event_time >= 103000000).first_valid_index()

This would take very long time since it will search from the beginning each time I call the function. But there should be more efficient way to do this.

Any comment or idea are appreciated

I have checked the manual of pandas but I can not find the solution

CodePudding user response：

Use numpy.searchsorted:

np.random.seed(2022)
    
#sample data
df = pd.DataFrame({'event_time':np.random.randint(700, size=500000)}).cumsum() 
df.loc[20, 'event_time'] = 83000000
df = df.sort_values('event_time', ignore_index=True)
print (df)

#original solution
a = df.where(df.event_time >= 83000000).first_valid_index()
b = df.where(df.event_time >= 90000000).first_valid_index()
c = df.where(df.event_time >= 93000000).first_valid_index()
d = df.where(df.event_time >= 100000000).first_valid_index()
e = df.where(df.event_time >= 103000000).first_valid_index()

print (a,b,c,d,e)
237709 257737 266255 286277 294890

arr = [83000000,90000000,93000000,100000000,103000000]
out = np.searchsorted(df['event_time'].to_numpy(), arr)
print (out)
[237709 257737 266255 286277 294890]

Performance:

#original solution
%%timeit
a = df.where(df.event_time >= 83000000).first_valid_index()
b = df.where(df.event_time >= 90000000).first_valid_index()
c = df.where(df.event_time >= 93000000).first_valid_index()
d = df.where(df.event_time >= 100000000).first_valid_index()
e = df.where(df.event_time >= 103000000).first_valid_index()

86.5 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#Michael Cao solution
%%timeit
a = df.where(df.event_time >= 83000000).first_valid_index()
b = df.loc[a:].where(df.event_time >= 90000000).first_valid_index()
c = df.loc[b:].where(df.event_time >= 93000000).first_valid_index()
d = df.loc[c:].where(df.event_time >= 100000000).first_valid_index()
e = df.loc[d:].where(df.event_time >= 103000000).first_valid_index()

99.6 ms ± 608 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


%%timeit
arr = [83000000,90000000,93000000,100000000,103000000]
np.searchsorted(df['event_time'], arr)


16.9 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


%%timeit
np.searchsorted(df['event_time'].to_numpy(), arr)

8.93 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

CodePudding user response：

Store the previous result and start looking from there using df.loc

a = df.where(df.event_time >= 83000000).first_valid_index()
b = df.loc[a:].where(df.event_time >= 90000000).first_valid_index()
c = df.loc[b:].where(df.event_time >= 93000000).first_valid_index()
d = df.loc[c:].where(df.event_time >= 100000000).first_valid_index()
e = df.loc[d:].where(df.event_time >= 103000000).first_valid_index()