Home > Mobile >  Pandas DataFrame merge line by line
Pandas DataFrame merge line by line

Time:03-03

I have to merge two dataframes into one, line by line (1 line from the first df and 1 line from the second df). Also, I need the fastest runtime possible.

Input:

df1:
      timestamp   radar_name
0        101         Front
1        102         Front
2        103         Front
3        104         Front

df2
      timestamp    radar_name
0        101          Rear
1        102          Rear
2        103          Rear
3        104          Rear

Output:

merged_df:

      timestamp   radar_name
0        101         Front
1        101         Rear
2        102         Front
3        102         Rear
4        103         Front
5        103         Rear
6        104         Front
7        104         Rear

Currently I have implemented two methods:

1.Iterating through the files with a for loop - running time is ~1min 50 secs

for row_cnt in range(len(first_half)):
    merged_file.loc[merged_file_index] = first_half.loc[row_cnt]
    merged_file_index  = 1
    merged_file.loc[merged_file_index] = second_half.loc[row_cnt]
    merged_file_index  = 1

2.Concat df1 and f2 than sort by timestamp - running time is ~1 min

frames=[df1,df2]
merged_file_2=pd.concat(frames)
merged_file_2.sort_values(by=['timestamp'],inplace=True)
merged_file_2.reset_index(inplace=True)
merged_file_2.drop(columns=['index'],inplace=True)

Per 1 file the times are manageable, but i have 100 of these file merging that need to be ran multiple times and in the end the time adds up.

Is there another way that I can use to make the merging faster?

CodePudding user response:

If you make your existing code a bit simpler, you should already get a really good improvement. Moreover, if you add the ignore_index in the sort, you will not need to drop the index, which will speed things up.

You can also play around with the kind argument of sort_values. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

I would do this:

merged_file_3=pd.concat([df1,df2]).sort_values(by=['timestamp'], kind="stable",ignore_index=True)

CodePudding user response:

You can sort by index using a stable sort:

df3 = (pd.concat([df1, df2])
         .sort_index(kind='stable')
         .reset_index(drop=True)
      )

output:

   timestamp radar_name
0        101      Front
1        101       Rear
2        102      Front
3        102       Rear
4        103      Front
5        103       Rear
6        104      Front
7        104       Rear

Or use a pre-computed index slice:

import numpy as np

idx = np.argsort(np.r_[np.arange(df1.shape[0]), np.arange(df2.shape[0])])
# array([0, 4, 1, 5, 2, 6, 3, 7])

df3 = pd.concat([df1, df2]).iloc[idx]

variant for an arbitrary number of dataframes:

dfs = [df1, df2]

idx = np.argsort(np.concatenate([np.arange(d.shape[0]) for d in dfs]))

df3 = pd.concat(dfs, ignore_index=True).iloc[idx]

output:

   timestamp radar_name
0        101      Front
0        101       Rear
1        102      Front
1        102       Rear
2        103      Front
2        103       Rear
3        104      Front
3        104       Rear
  • Related