Home > Software engineering >  Quickest way to merge two very large pandas dataframes using python
Quickest way to merge two very large pandas dataframes using python

Time:05-03

I have multiple sets of very large csv files that I need to merge based on a unique ID. This unique ID I set as the index which is based on a concatenation my Origin and Destination columns.

Dataframe 1:

Origin Destination Value
70478 70 478 0.002779
70479 70 479 0.001673
70480 70 480 0.000427
70481 70 481 0.001503
70482 70 482 0.01215
70483 70 483 0.004507
70484 70 484 0.001871
70485 70 485 0.006522
70486 70 486 0.004786
70487 70 487 0.026566

Dataframe 2:

Origin Destination Value
70478 70 478 135.974365
70479 70 479 130.936752
70480 70 480 111.191734
70481 70 481 98.170746
70482 70 482 88.257645
70483 70 483 102.095566
70484 70 484 103.585373
70485 70 485 114.298431
70486 70 486 97.331055
70487 70 487 85.754776

My final table should be as follows (Demand = Value from df1; Time = Value from df2; Demand_Time = Time/Demand):

Origin Destination Demand Time Demand_Time
0 70 478 0.002779 135.974365 0.377858
1 70 479 0.001673 130.936752 0.219041
2 70 480 0.000427 111.191734 0.047494
3 70 481 0.001503 98.170746 0.147536
4 70 482 0.01215 88.257645 1.072321
5 70 483 0.004507 102.095566 0.460115
6 70 484 0.001871 103.585373 0.193806
7 70 485 0.006522 114.298431 0.74551
8 70 486 0.004786 97.331055 0.465854
9 70 487 0.026566 85.754776 2.278125

I do a .compare between df1 and df2 which produces the following new dataframe:

Origin Destination Value
self other self other self other
70478 70 70 478 478 0.002779 135.974365
70479 70 70 479 479 0.001673 130.936752
70480 70 70 480 480 0.000427 111.191734
70481 70 70 481 481 0.001503 98.170746
70482 70 70 482 482 0.01215 88.257645
70483 70 70 483 483 0.004507 102.095566
70484 70 70 484 484 0.001871 103.585373
70485 70 70 485 485 0.006522 114.298431
70486 70 70 486 486 0.004786 97.331055
70487 70 70 487 487 0.026566 85.754776

I then create a new final pd.DataFrame df, iterate over my compare table above and .append to my final new df.

The last part that iterates and appends takes a very long time on very large tables (a few hundred thousand records each) - about 1.5 hours each time.

Is there a way to do this last part more efficiently?

Thank you.

Code sample:

import pandas as pd

# Replicating sample df1 (.read_csv from csv file 1)
df_1_data = [[70, 478, 0.0027788935694843],
             [70, 479, 0.0016728754853829],
             [70, 480, 0.0004271405050531],
             [70, 481, 0.0015028485795482],
             [70, 482, 0.0121498983353376],
             [70, 483, 0.0045067127794027],
             [70, 484, 0.0018709792057052],
             [70, 485, 0.0065224897116422],
             [70, 486, 0.0047862790524959],
             [70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str)   df_1['Destination'].astype(str))
print(df_1)

# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
             [70, 479, 130.9367523],
             [70, 480, 111.1917343],
             [70, 481, 98.17074585],
             [70, 482, 88.25764465],
             [70, 483, 102.0955658],
             [70, 484, 103.5853729],
             [70, 485, 114.2984314],
             [70, 486, 97.33105469],
             [70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str)   df_2['Destination'].astype(str))
print(df_2)

df_compare = df_1.compare(df_2, keep_shape=True, keep_equal=True)
print(df_compare)

df_out = pd.DataFrame(columns=['Origin', 'Destination', 'Demand', 'Time', 'Demand_Time'])
for index, row in df_compare.iterrows():
    df_out = df_out.append({'Origin': int(row['Origin']['self']), 'Destination': int(row['Destination']['self']),
                            'Demand': row['Value']['self'], 'Time': row['Value']['other'],
                            'Demand_Time': row['Value']['self'] * row['Value']['other']}, ignore_index=True)
print(df_out)

print('\nCOMPLETED')

CodePudding user response:

IIUC, you can use:

out = (df1.rename(columns={'Value': 'Demand'})
          .assign(Time=df2['Value'], Demand_Time=df2['Value'] * df1['Value'])
          .reset_index(drop=True))
print(out)

# Output
   Origin  Destination    Demand        Time  Demand_Time
0      70          478  0.002779  135.974365     0.377873
1      70          479  0.001673  130.936752     0.219057
2      70          480  0.000427  111.191734     0.047479
3      70          481  0.001503   98.170746     0.147551
4      70          482  0.012150   88.257645     1.072330
5      70          483  0.004507  102.095566     0.460145
6      70          484  0.001871  103.585373     0.193808
7      70          485  0.006522  114.298431     0.745454
8      70          486  0.004786   97.331055     0.465826
9      70          487  0.026566   85.754776     2.278161

CodePudding user response:

If I understood the request correctly I would use a combination of pandas and numby to get the results you want in a timely manner

import datetime
import numpy as np

df_1_data = [[70, 478, 0.0027788935694843],
             [70, 479, 0.0016728754853829],
             [70, 480, 0.0004271405050531],
             [70, 481, 0.0015028485795482],
             [70, 482, 0.0121498983353376],
             [70, 483, 0.0045067127794027],
             [70, 484, 0.0018709792057052],
             [70, 485, 0.0065224897116422],
             [70, 486, 0.0047862790524959],
             [70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str)   df_1['Destination'].astype(str))

# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
             [70, 479, 130.9367523],
             [70, 480, 111.1917343],
             [70, 481, 98.17074585],
             [70, 482, 88.25764465],
             [70, 483, 102.0955658],
             [70, 484, 103.5853729],
             [70, 485, 114.2984314],
             [70, 486, 97.33105469],
             [70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str)   df_2['Destination'].astype(str))

df_1.columns = [['Origin', 'Destination', 'Demand']]
df_2.columns = [['Origin', 'Destination', 'Time']]
df_merge = df_1.merge(df_2, how = 'inner')
df_merge['Demand_Time'] = df_merge['Time'].values / df_merge['Demand'].values
df_merge
  • Related