I have two dataframes that i need to compare on two columns based on one condition.
df1 looks like this;
Name | Value_1 | Value_2 |
---|---|---|
A | cat | pat |
B | mat | mat |
C | pat | sat |
df2 looks like this;
Name | Value_1 | Value_2 |
---|---|---|
D | pat | cat |
E | bat | mat |
F | tot | pat |
I would like to match Value_1 and Value_2 from both the dataframes and return the Names from both the data frames when there is match in either value_1 or value_2.
result should look like this- It should have the names from both df1 and df2 if there is a match for value_1 or value_2.
Name_x | Value | Name_y |
---|---|---|
A | pat | D |
B | bat | E |
C | tot | F |
I have tried pd.merge with 'on' condition and 'left_on' and 'right_on'. It works on matching one column. I cannot specify either/or condition. pd.merge only returns a result if there is a match in both value_1 and value_2. How do i get the Names when either value_1 or value_2 from df1 has a match with either value_1 or value_2 in df2?
CodePudding user response:
You can use pd.wide_to_long
to unpivot a DataFrame from wide to long format then merge the dataframe
df1 = (pd.wide_to_long(df1, stubnames='Value', i='Name', sep='_', j='i')
.reset_index(level='Name'))
df2 = (pd.wide_to_long(df2, stubnames='Value', i='Name', sep='_', j='i')
.reset_index(level='Name'))
out = pd.merge(df1, df2, on=['Value']).drop_duplicates()
print(out)
Name_x Value Name_y
0 A cat D
1 B mat E
3 C pat D
4 C pat F
5 A pat D
6 A pat F
CodePudding user response:
If you don't care if it is in Value_1 or Value_2, use melt to flatten each dataframe to a single column, and merge on that new column
>>> df1
Name Value_1 Value_2
0 A cat pat
1 B mat mat
2 C pat sat
>>> df2
Name Value_1 Value_2
0 D pat cat
1 E bat mat
2 F tot pat
>>> df1.melt('Name')
Name variable value
0 A Value_1 cat
1 B Value_1 mat
2 C Value_1 pat
3 A Value_2 pat
4 B Value_2 mat
5 C Value_2 sat
>>> df1.melt('Name').merge(df2.melt('Name'), on='value')
Name_x variable_x value Name_y variable_y
0 A Value_1 cat D Value_2
1 B Value_1 mat E Value_2
2 B Value_2 mat E Value_2
3 C Value_1 pat D Value_1
4 C Value_1 pat F Value_2
5 A Value_2 pat D Value_1
6 A Value_2 pat F Value_2
>>>
CodePudding user response:
From your OP I take it you are happy if you get repeats. For example, B value_1 and value_2 both match E value_2 to create two rows. If so, melting and merging will get you there:
newdf = df1.melt(id_vars = 'Name').merge(df2.melt(id_vars='Name'), how='inner', on='value')
Output:
Name_x variable_x value Name_y variable_y
0 A Value_1 cat D Value_2
1 B Value_1 mat E Value_2
2 B Value_2 mat E Value_2
3 C Value_1 pat D Value_1
4 C Value_1 pat F Value_2
5 A Value_2 pat D Value_1
6 A Value_2 pat F Value_2
Then if you want to just get the columns shown in the OP you can just extract the columns needed:
newdf = newdf[['Name_x', 'value', 'Name_y']]
Output:
Name_x value Name_y
0 A cat D
1 B mat E
2 B mat E
3 C pat D
4 C pat F
5 A pat D
6 A pat F
Then to get rid of index you can set Name_x to index if you like. Hope this helps.