I have 2 data frames with different lengths: len(df1) 10104 len(df2) 15560
I want to merge these based on a common column (taskID) on both data frames. The Task ID has repeated IDs each ID represent an item belongs to the same task.
example:
df1:
TaskID | task duration |
---|---|
task 45 | 2 mins |
task 45 | 5 mins |
task 45 | 7 mins |
task 67 | 8 mins |
task 67 | 9 mins |
df2:
TaskID | gender |
---|---|
task 45 | male |
task 45 | male |
task 45 | male |
task 67 | female |
task 67 | female |
when I use Pandas merge function I get an output with a length of 471437 which is greater than the length of both data frames.
I think this is because of the repeated values of TaskID, and I can't remove the duplicates as each row represent different item.
I tried:
merged=pd.merge(df1,df2, on='TaskID', how='inner')
I get an output with a length of 471437
Can you please help with this issue ?
CodePudding user response:
You can use groupby.cumcount
to deduplicate the TaskIDs and merge in order:
df1.merge(df2, left_on=['TaskID', df1.groupby('TaskID').cumcount()],
right_on=['TaskID', df2.groupby('TaskID').cumcount()])
Output:
TaskID key_1 task duration gender
0 task 45 0 2 mins male
1 task 45 1 5 mins male
2 task 45 2 7 mins male
3 task 67 0 8 mins female
4 task 67 1 9 mins female
If the gender is unique per TaskID, you can update in place with:
df1['gender'] = df1['TaskID'].map(df2.drop_duplicates('TaskID').set_index('TaskID')['gender'])
CodePudding user response:
You can cross join the two datasets and mask & remove the duplicated entries:
df1 = pd.DataFrame(data=[["task 45","2 mins"],["task 45","5 mins"],["task 45","7 mins"],["task 67","8 mins"],["task 67","9 mins"]], columns=["TaskID","task duration"])
df2 = pd.DataFrame(data=[["task 45","male"],["task 45","male"],["task 45","male"],["task 67","female"],["task 67","female"]], columns=["TaskID","gender"])
result_df = df1.merge(df2)
cols = ['TaskID','task duration','gender']
result_df[['task duration','gender']] = result_df[cols].mask(result_df[cols].duplicated())[['task duration','gender']]
result_df = result_df.dropna().reset_index(drop=True)
[Out]:
TaskID task duration gender
0 task 45 2 mins male
1 task 45 5 mins male
2 task 45 7 mins male
3 task 67 8 mins female
4 task 67 9 mins female