One useful function in Pandas is the first_valid_index
One obvious question is that suppose I want to find multiple first_valid_index for my dataframe provided that the particular column in increasing (in this example, it is event_time):
df.where(df.event_time >= 83000000).first_valid_index()
df.where(df.event_time >= 90000000).first_valid_index()
df.where(df.event_time >= 93000000).first_valid_index()
df.where(df.event_time >= 100000000).first_valid_index()
df.where(df.event_time >= 103000000).first_valid_index()
This would take very long time since it will search from the beginning each time I call the function. But there should be more efficient way to do this.
Any comment or idea are appreciated
I have checked the manual of pandas but I can not find the solution
CodePudding user response:
Use numpy.searchsorted
:
np.random.seed(2022)
#sample data
df = pd.DataFrame({'event_time':np.random.randint(700, size=500000)}).cumsum()
df.loc[20, 'event_time'] = 83000000
df = df.sort_values('event_time', ignore_index=True)
print (df)
#original solution
a = df.where(df.event_time >= 83000000).first_valid_index()
b = df.where(df.event_time >= 90000000).first_valid_index()
c = df.where(df.event_time >= 93000000).first_valid_index()
d = df.where(df.event_time >= 100000000).first_valid_index()
e = df.where(df.event_time >= 103000000).first_valid_index()
print (a,b,c,d,e)
237709 257737 266255 286277 294890
arr = [83000000,90000000,93000000,100000000,103000000]
out = np.searchsorted(df['event_time'].to_numpy(), arr)
print (out)
[237709 257737 266255 286277 294890]
Performance:
#original solution
%%timeit
a = df.where(df.event_time >= 83000000).first_valid_index()
b = df.where(df.event_time >= 90000000).first_valid_index()
c = df.where(df.event_time >= 93000000).first_valid_index()
d = df.where(df.event_time >= 100000000).first_valid_index()
e = df.where(df.event_time >= 103000000).first_valid_index()
86.5 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Michael Cao solution
%%timeit
a = df.where(df.event_time >= 83000000).first_valid_index()
b = df.loc[a:].where(df.event_time >= 90000000).first_valid_index()
c = df.loc[b:].where(df.event_time >= 93000000).first_valid_index()
d = df.loc[c:].where(df.event_time >= 100000000).first_valid_index()
e = df.loc[d:].where(df.event_time >= 103000000).first_valid_index()
99.6 ms ± 608 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
arr = [83000000,90000000,93000000,100000000,103000000]
np.searchsorted(df['event_time'], arr)
16.9 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
np.searchsorted(df['event_time'].to_numpy(), arr)
8.93 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
CodePudding user response:
Store the previous result and start looking from there using df.loc
a = df.where(df.event_time >= 83000000).first_valid_index()
b = df.loc[a:].where(df.event_time >= 90000000).first_valid_index()
c = df.loc[b:].where(df.event_time >= 93000000).first_valid_index()
d = df.loc[c:].where(df.event_time >= 100000000).first_valid_index()
e = df.loc[d:].where(df.event_time >= 103000000).first_valid_index()