I have multiple sets of very large csv files that I need to merge based on a unique ID. This unique ID I set as the index which is based on a concatenation my Origin and Destination columns.
Dataframe 1:
Origin | Destination | Value | |
---|---|---|---|
70478 | 70 | 478 | 0.002779 |
70479 | 70 | 479 | 0.001673 |
70480 | 70 | 480 | 0.000427 |
70481 | 70 | 481 | 0.001503 |
70482 | 70 | 482 | 0.01215 |
70483 | 70 | 483 | 0.004507 |
70484 | 70 | 484 | 0.001871 |
70485 | 70 | 485 | 0.006522 |
70486 | 70 | 486 | 0.004786 |
70487 | 70 | 487 | 0.026566 |
Dataframe 2:
Origin | Destination | Value | |
---|---|---|---|
70478 | 70 | 478 | 135.974365 |
70479 | 70 | 479 | 130.936752 |
70480 | 70 | 480 | 111.191734 |
70481 | 70 | 481 | 98.170746 |
70482 | 70 | 482 | 88.257645 |
70483 | 70 | 483 | 102.095566 |
70484 | 70 | 484 | 103.585373 |
70485 | 70 | 485 | 114.298431 |
70486 | 70 | 486 | 97.331055 |
70487 | 70 | 487 | 85.754776 |
My final table should be as follows (Demand = Value from df1; Time = Value from df2; Demand_Time = Time/Demand):
Origin | Destination | Demand | Time | Demand_Time | |
---|---|---|---|---|---|
0 | 70 | 478 | 0.002779 | 135.974365 | 0.377858 |
1 | 70 | 479 | 0.001673 | 130.936752 | 0.219041 |
2 | 70 | 480 | 0.000427 | 111.191734 | 0.047494 |
3 | 70 | 481 | 0.001503 | 98.170746 | 0.147536 |
4 | 70 | 482 | 0.01215 | 88.257645 | 1.072321 |
5 | 70 | 483 | 0.004507 | 102.095566 | 0.460115 |
6 | 70 | 484 | 0.001871 | 103.585373 | 0.193806 |
7 | 70 | 485 | 0.006522 | 114.298431 | 0.74551 |
8 | 70 | 486 | 0.004786 | 97.331055 | 0.465854 |
9 | 70 | 487 | 0.026566 | 85.754776 | 2.278125 |
I do a .compare
between df1 and df2 which produces the following new dataframe:
Origin | Destination | Value | ||||
---|---|---|---|---|---|---|
self | other | self | other | self | other | |
70478 | 70 | 70 | 478 | 478 | 0.002779 | 135.974365 |
70479 | 70 | 70 | 479 | 479 | 0.001673 | 130.936752 |
70480 | 70 | 70 | 480 | 480 | 0.000427 | 111.191734 |
70481 | 70 | 70 | 481 | 481 | 0.001503 | 98.170746 |
70482 | 70 | 70 | 482 | 482 | 0.01215 | 88.257645 |
70483 | 70 | 70 | 483 | 483 | 0.004507 | 102.095566 |
70484 | 70 | 70 | 484 | 484 | 0.001871 | 103.585373 |
70485 | 70 | 70 | 485 | 485 | 0.006522 | 114.298431 |
70486 | 70 | 70 | 486 | 486 | 0.004786 | 97.331055 |
70487 | 70 | 70 | 487 | 487 | 0.026566 | 85.754776 |
I then create a new final pd.DataFrame
df, iterate over my compare table above and .append
to my final new df.
The last part that iterates and appends takes a very long time on very large tables (a few hundred thousand records each) - about 1.5 hours each time.
Is there a way to do this last part more efficiently?
Thank you.
Code sample:
import pandas as pd
# Replicating sample df1 (.read_csv from csv file 1)
df_1_data = [[70, 478, 0.0027788935694843],
[70, 479, 0.0016728754853829],
[70, 480, 0.0004271405050531],
[70, 481, 0.0015028485795482],
[70, 482, 0.0121498983353376],
[70, 483, 0.0045067127794027],
[70, 484, 0.0018709792057052],
[70, 485, 0.0065224897116422],
[70, 486, 0.0047862790524959],
[70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) df_1['Destination'].astype(str))
print(df_1)
# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
[70, 479, 130.9367523],
[70, 480, 111.1917343],
[70, 481, 98.17074585],
[70, 482, 88.25764465],
[70, 483, 102.0955658],
[70, 484, 103.5853729],
[70, 485, 114.2984314],
[70, 486, 97.33105469],
[70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) df_2['Destination'].astype(str))
print(df_2)
df_compare = df_1.compare(df_2, keep_shape=True, keep_equal=True)
print(df_compare)
df_out = pd.DataFrame(columns=['Origin', 'Destination', 'Demand', 'Time', 'Demand_Time'])
for index, row in df_compare.iterrows():
df_out = df_out.append({'Origin': int(row['Origin']['self']), 'Destination': int(row['Destination']['self']),
'Demand': row['Value']['self'], 'Time': row['Value']['other'],
'Demand_Time': row['Value']['self'] * row['Value']['other']}, ignore_index=True)
print(df_out)
print('\nCOMPLETED')
CodePudding user response:
IIUC, you can use:
out = (df1.rename(columns={'Value': 'Demand'})
.assign(Time=df2['Value'], Demand_Time=df2['Value'] * df1['Value'])
.reset_index(drop=True))
print(out)
# Output
Origin Destination Demand Time Demand_Time
0 70 478 0.002779 135.974365 0.377873
1 70 479 0.001673 130.936752 0.219057
2 70 480 0.000427 111.191734 0.047479
3 70 481 0.001503 98.170746 0.147551
4 70 482 0.012150 88.257645 1.072330
5 70 483 0.004507 102.095566 0.460145
6 70 484 0.001871 103.585373 0.193808
7 70 485 0.006522 114.298431 0.745454
8 70 486 0.004786 97.331055 0.465826
9 70 487 0.026566 85.754776 2.278161
CodePudding user response:
If I understood the request correctly I would use a combination of pandas and numby to get the results you want in a timely manner
import datetime
import numpy as np
df_1_data = [[70, 478, 0.0027788935694843],
[70, 479, 0.0016728754853829],
[70, 480, 0.0004271405050531],
[70, 481, 0.0015028485795482],
[70, 482, 0.0121498983353376],
[70, 483, 0.0045067127794027],
[70, 484, 0.0018709792057052],
[70, 485, 0.0065224897116422],
[70, 486, 0.0047862790524959],
[70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) df_1['Destination'].astype(str))
# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
[70, 479, 130.9367523],
[70, 480, 111.1917343],
[70, 481, 98.17074585],
[70, 482, 88.25764465],
[70, 483, 102.0955658],
[70, 484, 103.5853729],
[70, 485, 114.2984314],
[70, 486, 97.33105469],
[70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) df_2['Destination'].astype(str))
df_1.columns = [['Origin', 'Destination', 'Demand']]
df_2.columns = [['Origin', 'Destination', 'Time']]
df_merge = df_1.merge(df_2, how = 'inner')
df_merge['Demand_Time'] = df_merge['Time'].values / df_merge['Demand'].values
df_merge