I have two dataframes: df2
Time(s) vehicle_id
10 1
11 1
12 1
13 1
... ...
78 2
79 2
80 2
And com:
index v1 v2
1 1 2
I would like to find out the index of the row which matches the following condition:
if (abs(max(df2.loc[df2['vehicle_id'] == com['v1'][i], 'time(s)']) - max(df2.loc[df2['vehicle_id'] == com['v2'][i], 'time(s)']))>60.0):
arr.append(i)
In this case, the condition is true since the max time from v1 is 13 and from v2 is 80, then, 13-80=-67=67 It works, however, it has to compare against more than 40M rows and it is taking about 200k rows/hour, so, it will take hours. Is there a faster way to proceed? I just need to obtain the indexes from com which matches the above condition.
Update: I check all the condition positions using a FOR loop
for i in range(0, len(com)):
if (abs(max(df2.loc[df2['vehicle_id'] == com['v1'][i], 'time(s)']) - max(df2.loc[df2['vehicle_id'] == com['v2'][i], 'time(s)']))>60.0):
arr.append(i)
CodePudding user response:
It looks like you only need to consider the max time for each id, so why not sort then drop duplicates to keep only the max time per vehicle id. Then you can replace the vehicle ids in your comm df w/the times and perform your filter conditions on the entire result, saving the resulting indexes to an array.
import pandas as pd
df2 = pd.DataFrame({'Time(s)': [10, 11, 12, 13, 78, 79, 80], 'vehicle_id': [1, 1, 1, 1, 2, 2, 2]})
comm = pd.DataFrame({'index': [1], 'v1': [1], 'v2': [2]})
m = df.sort_values(by='Time(s)').drop_duplicates(subset='vehicle_id', keep='last')
comm = comm.set_index('index').replace(m.set_index('vehicle_id')['Time(s)'].to_dict())
arr = comm.loc[(comm['v1']-comm['v2']).abs().gt(60)].index.values
print(arr)