I have 2 below dataframes. I would like to merge both dataframes on "Item" column and below is my expected output and Code so far. However, below pandas
code is displaying duplicate rows as shown below. Not sure what is wrong with the code ? I am thinking may be I have to do string search on the "Item"column and then merge the dataframe. However, I am not sure how to achieve this ? Please provide explanation with your code.
Thanks in advance for your time!
1st Dataframe
Item ID
ABC 1
ABC 2
ABC 3
CD 12
EF 11
2nd DataFrame
Name Item Price
Name1 ABC 123.00
Name2 ABC 110
Name 3 ABC 100
Name4 CD 50
Expected Output
Name Item ID Price
Name1 ABC 1 123.00
Name2 ABC 2 110
Name 3 ABC 3 100
Name4 CD 12 50
Code
pd.merge(df1,df2,on="Item",how='outer')
Above code displays below output
Item ID Name Price
0 ABC 1 Name1 123.0
1 ABC 1 Name2 110.0
2 ABC 1 Name3 100.0
3 ABC 2 Name1 123.0
4 ABC 2 Name2 110.0
5 ABC 2 Name3 100.0
6 ABC 3 Name1 123.0
7 ABC 3 Name2 110.0
8 ABC 3 Name3 100.0
9 CD 12 Name4 50.0
10 EF 11 NaN NaN
CodePudding user response:
The simplest method I can come up with is to add another column to your input dataframes which mimics the index [0, 1, 2, 3...]. This produces a unique entry even when the 'Item' is duplicated, and leads to a correct merge.
Example:
import pandas as pd
A = [[1, 2, 3, 4], ['A', 'A', 'B', 'C'], [0, 1, 2, 3]]
B = [[10, 20, 30, 40], ['A', 'A', 'B', 'C'], [0, 1, 2, 3]]
df = pd.DataFrame(A).T
df.columns = ['Val', 'ID', 'IDX']
print(df)
df2 = pd.DataFrame(B).T
df2.columns = ['Other', 'ID', 'IDX']
print(df2)
M = pd.merge(df, df2)
print(M)
CodePudding user response:
Enumerate Item
with groupby cumcount
in each DataFrame to allow for the positional alignment within the merge:
df3 = pd.merge(
df1, df2,
left_on=['Item', df1.groupby('Item').cumcount()],
right_on=['Item', df2.groupby('Item').cumcount()]
)
df3
:
Item key_1 ID Name Price
0 ABC 0 1 Name1 123.0
1 ABC 1 2 Name2 110.0
2 ABC 2 3 Name 3 100.0
3 CD 0 12 Name4 50.0
drop
can be added to remove this added merge key column:
df3 = pd.merge(
df1, df2,
left_on=['Item', df1.groupby('Item').cumcount()],
right_on=['Item', df2.groupby('Item').cumcount()]
).drop(columns='key_1')
df3
:
Item ID Name Price
0 ABC 1 Name1 123.0
1 ABC 2 Name2 110.0
2 ABC 3 Name 3 100.0
3 CD 12 Name4 50.0
Setup and imports:
import pandas as pd
df1 = pd.DataFrame({
'Item': ['ABC', 'ABC', 'ABC', 'CD', 'EF'],
'ID': [1, 2, 3, 12, 11]
})
df2 = pd.DataFrame({
'Name': ['Name1', 'Name2', 'Name 3', 'Name4'],
'Item': ['ABC', 'ABC', 'ABC', 'CD'],
'Price': [123.00, 110, 100, 50]
})