I need to improve the performance of the following dataframe slices matching. What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved.
My 2 dataframes:
>>>df1
trips sequence
0 11 a
1 11 d
2 21 d
3 21 a
4 31 a
5 31 b
6 31 c
>>>df2
trips sequence
0 12 a
1 12 d
2 22 c
3 22 b
4 22 a
5 32 a
6 32 d
Expected output:
['11 match 12']
This is the following code I' m using:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'trips': [11, 11, 21, 21, 31, 31, 31], 'sequence': ['a', 'd', 'd', 'a', 'a', 'b', 'c']})
df2 = pd.DataFrame({'trips': [12, 12, 22, 22, 22, 32, 32], 'sequence': ['a', 'd', 'c', 'b', 'a', 'a', 'd']})
route_match = []
for trip1 in df1['trips'].drop_duplicates():
for trip2 in df2['trips'].drop_duplicates():
route1 = df1[df1['trips'] == trip1]['sequence']
route2 = df2[df2['trips'] == trip2]['sequence']
if np.array_equal(route1.values,route2.values):
route_match.append(str(trip1) ' match ' str(trip2))
break
else:
continue
Despite working, this is very time costly and unefficient as my real dataframes are longer. Any suggestions?
CodePudding user response:
You can aggregate each trip as tuple with groupby.agg
, then merge
the two outputs to identify the identical routes:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
df2.groupby('trips', as_index=False)['sequence'].agg(tuple),
on='sequence'
)
output:
trips_x sequence trips_y
0 11 (a, d) 12
1 11 (a, d) 32
If you only want the first match, drop_duplicates
the output of df2
aggregation to prevent unnecessary merging:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
df2.groupby('trips', as_index=False)['sequence'].agg(tuple)
.drop_duplicates(subset='sequence'),
on='sequence'
)
output:
trips_x sequence trips_y
0 11 (a, d) 12