Tables and code at the bottom will probs help much more than description.
I have a solution that works but think its very inefficient see bottom.
Problem:
I have two data frames df_1 and df_2 -- these dataframes have a match column - match_id
df_2 has a date column that I am trying to write into df_1
every match_id in df_2 exists in df_1
I need date to be written sequentially into df_1 based in the match_id.
Importantly though I only want row of match_id in df_2 to be used once in the matching processes.
If there are not enough match_ids in df_2 to fill all the ids in df_1 then leave the remaining rows in df_1 blank.
It's much easier if I show it:
df_1:
index | match_id | date |
---|---|---|
0 | 45 | |
1 | 45 | |
2 | 45 | |
3 | 45 | |
4 | 46 | |
5 | 46 | |
6 | 47 |
df_2:
index | match_id | date |
---|---|---|
0 | 45 | 01/01/22 |
1 | 45 | 02/01/22 |
2 | 46 | 02/01/22 |
3 | 46 | 05/01/22 |
Output (updated df_1):
index | match_id | date |
---|---|---|
0 | 45 | 01/01/22 |
1 | 45 | 02/01/22 |
2 | 45 | |
3 | 45 | |
4 | 46 | 02/01/22 |
5 | 46 | 05/01/22 |
6 | 47 |
I have a solution that works but I am certain there must be a much for time/resource efficient way of doing it (still really new to python and pretty new to coding) as in practice running it over much larger datasets:
import pandas as pd
data_1 = [[45, ""], [45, ""],[45, ""],[45, ""],[46, ""],[46, ""],[47, ""]]
df_1 = pd.DataFrame(data_1, columns = ['match_id', 'date'])
data_2 = [[45, "01/01/22"], [45, "02/01/22"],[46, "01/01/22"],[46, "05/01/22"]]
df_2 = pd.DataFrame(data_2, columns = ['match_id', 'date'])
for i_df_1, r_df_1 in df_1.iterrows():
for i_df_2, r_df_2 in df_2.iterrows():
if r_df_1["match_id"] == r_df_2["match_id"]:
# Add data into the payment transaction dataframe
df_1.loc[i_df_1,"date"] = r_df_2["date"]
# Drop the used row from df_2 so does not get used again
df_2 = df_2.drop(i_df_2)
break
continue
CodePudding user response:
You can compute an extra key with groupby.cumcount
and use it in merge
:
df_3 = (df_1
#.drop(columns='date') # uncomment if df1 already has an empty date column
.merge(df_2,
left_on=['match_id', df_1.groupby('match_id').cumcount()],
right_on=['match_id', df_2.groupby('match_id').cumcount()],
how='left'
)
#.drop(columns='key_1') # uncomment if unwanted
)
output:
match_id key_1 date
0 45 0 01/01/22
1 45 1 02/01/22
2 45 2 NaN
3 45 3 NaN
4 46 0 02/01/22
5 46 1 05/01/22
6 47 0 NaN