I have to merge two dataframes into one, line by line (1 line from the first df and 1 line from the second df). Also, I need the fastest runtime possible.
Input:
df1:
timestamp radar_name
0 101 Front
1 102 Front
2 103 Front
3 104 Front
df2
timestamp radar_name
0 101 Rear
1 102 Rear
2 103 Rear
3 104 Rear
Output:
merged_df:
timestamp radar_name
0 101 Front
1 101 Rear
2 102 Front
3 102 Rear
4 103 Front
5 103 Rear
6 104 Front
7 104 Rear
Currently I have implemented two methods:
1.Iterating through the files with a for loop - running time is ~1min 50 secs
for row_cnt in range(len(first_half)):
merged_file.loc[merged_file_index] = first_half.loc[row_cnt]
merged_file_index = 1
merged_file.loc[merged_file_index] = second_half.loc[row_cnt]
merged_file_index = 1
2.Concat df1 and f2 than sort by timestamp - running time is ~1 min
frames=[df1,df2]
merged_file_2=pd.concat(frames)
merged_file_2.sort_values(by=['timestamp'],inplace=True)
merged_file_2.reset_index(inplace=True)
merged_file_2.drop(columns=['index'],inplace=True)
Per 1 file the times are manageable, but i have 100 of these file merging that need to be ran multiple times and in the end the time adds up.
Is there another way that I can use to make the merging faster?
CodePudding user response:
If you make your existing code a bit simpler, you should already get a really good improvement. Moreover, if you add the ignore_index
in the sort, you will not need to drop the index, which will speed things up.
You can also play around with the kind
argument of sort_values
. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
I would do this:
merged_file_3=pd.concat([df1,df2]).sort_values(by=['timestamp'], kind="stable",ignore_index=True)
CodePudding user response:
You can sort by index using a stable sort:
df3 = (pd.concat([df1, df2])
.sort_index(kind='stable')
.reset_index(drop=True)
)
output:
timestamp radar_name
0 101 Front
1 101 Rear
2 102 Front
3 102 Rear
4 103 Front
5 103 Rear
6 104 Front
7 104 Rear
Or use a pre-computed index slice:
import numpy as np
idx = np.argsort(np.r_[np.arange(df1.shape[0]), np.arange(df2.shape[0])])
# array([0, 4, 1, 5, 2, 6, 3, 7])
df3 = pd.concat([df1, df2]).iloc[idx]
variant for an arbitrary number of dataframes:
dfs = [df1, df2]
idx = np.argsort(np.concatenate([np.arange(d.shape[0]) for d in dfs]))
df3 = pd.concat(dfs, ignore_index=True).iloc[idx]
output:
timestamp radar_name
0 101 Front
0 101 Rear
1 102 Front
1 102 Rear
2 103 Front
2 103 Rear
3 104 Front
3 104 Rear