Merging two data frames based on a common column with repeated values-CodePudding

I have 2 data frames with different lengths: len(df1) 10104 len(df2) 15560

I want to merge these based on a common column (taskID) on both data frames. The Task ID has repeated IDs each ID represent an item belongs to the same task.

example:

df1:

TaskID	task duration
task 45	2 mins
task 45	5 mins
task 45	7 mins
task 67	8 mins
task 67	9 mins

df2:

TaskID	gender
task 45	male
task 45	male
task 45	male
task 67	female
task 67	female

when I use Pandas merge function I get an output with a length of 471437 which is greater than the length of both data frames.

I think this is because of the repeated values of TaskID, and I can't remove the duplicates as each row represent different item.

I tried:

merged=pd.merge(df1,df2, on='TaskID', how='inner')

I get an output with a length of 471437

Can you please help with this issue ?

CodePudding user response：

You can use groupby.cumcount to deduplicate the TaskIDs and merge in order:

df1.merge(df2, left_on=['TaskID', df1.groupby('TaskID').cumcount()],
               right_on=['TaskID', df2.groupby('TaskID').cumcount()])

Output:

    TaskID  key_1 task duration  gender
0  task 45      0        2 mins    male
1  task 45      1        5 mins    male
2  task 45      2        7 mins    male
3  task 67      0        8 mins  female
4  task 67      1        9 mins  female

If the gender is unique per TaskID, you can update in place with:

df1['gender'] = df1['TaskID'].map(df2.drop_duplicates('TaskID').set_index('TaskID')['gender'])

CodePudding user response：

You can cross join the two datasets and mask & remove the duplicated entries:

df1 = pd.DataFrame(data=[["task 45","2 mins"],["task 45","5 mins"],["task 45","7 mins"],["task 67","8 mins"],["task 67","9 mins"]], columns=["TaskID","task duration"])
df2 = pd.DataFrame(data=[["task 45","male"],["task 45","male"],["task 45","male"],["task 67","female"],["task 67","female"]], columns=["TaskID","gender"])
result_df = df1.merge(df2)
cols = ['TaskID','task duration','gender']
result_df[['task duration','gender']] = result_df[cols].mask(result_df[cols].duplicated())[['task duration','gender']]
result_df = result_df.dropna().reset_index(drop=True)

[Out]:
    TaskID task duration  gender
0  task 45        2 mins    male
1  task 45        5 mins    male
2  task 45        7 mins    male
3  task 67        8 mins  female
4  task 67        9 mins  female