Home > Software engineering >  Compare two different columns from two dataframes
Compare two different columns from two dataframes

Time:05-24

I have two dataframes that i need to compare on two columns based on one condition.

df1 looks like this;

Name Value_1 Value_2
A cat pat
B mat mat
C pat sat

df2 looks like this;

Name Value_1 Value_2
D pat cat
E bat mat
F tot pat

I would like to match Value_1 and Value_2 from both the dataframes and return the Names from both the data frames when there is match in either value_1 or value_2.

result should look like this- It should have the names from both df1 and df2 if there is a match for value_1 or value_2.

Name_x Value Name_y
A pat D
B bat E
C tot F

I have tried pd.merge with 'on' condition and 'left_on' and 'right_on'. It works on matching one column. I cannot specify either/or condition. pd.merge only returns a result if there is a match in both value_1 and value_2. How do i get the Names when either value_1 or value_2 from df1 has a match with either value_1 or value_2 in df2?

CodePudding user response:

You can use pd.wide_to_long to unpivot a DataFrame from wide to long format then merge the dataframe

df1 = (pd.wide_to_long(df1, stubnames='Value', i='Name', sep='_', j='i')
       .reset_index(level='Name'))
df2 = (pd.wide_to_long(df2, stubnames='Value', i='Name', sep='_', j='i')
       .reset_index(level='Name'))

out = pd.merge(df1, df2, on=['Value']).drop_duplicates()
print(out)

  Name_x Value Name_y
0      A   cat      D
1      B   mat      E
3      C   pat      D
4      C   pat      F
5      A   pat      D
6      A   pat      F

CodePudding user response:

If you don't care if it is in Value_1 or Value_2, use melt to flatten each dataframe to a single column, and merge on that new column

>>> df1
  Name Value_1 Value_2
0    A     cat     pat
1    B     mat     mat
2    C     pat     sat
>>> df2
  Name Value_1 Value_2
0    D     pat     cat
1    E     bat     mat
2    F     tot     pat

>>> df1.melt('Name')
  Name variable value
0    A  Value_1   cat
1    B  Value_1   mat
2    C  Value_1   pat
3    A  Value_2   pat
4    B  Value_2   mat
5    C  Value_2   sat

>>> df1.melt('Name').merge(df2.melt('Name'), on='value')
  Name_x variable_x value Name_y variable_y
0      A    Value_1   cat      D    Value_2
1      B    Value_1   mat      E    Value_2
2      B    Value_2   mat      E    Value_2
3      C    Value_1   pat      D    Value_1
4      C    Value_1   pat      F    Value_2
5      A    Value_2   pat      D    Value_1
6      A    Value_2   pat      F    Value_2
>>>

CodePudding user response:

From your OP I take it you are happy if you get repeats. For example, B value_1 and value_2 both match E value_2 to create two rows. If so, melting and merging will get you there:

newdf = df1.melt(id_vars = 'Name').merge(df2.melt(id_vars='Name'), how='inner', on='value')

Output:

    Name_x  variable_x  value   Name_y  variable_y
0   A        Value_1    cat     D       Value_2
1   B        Value_1    mat     E       Value_2
2   B        Value_2    mat     E       Value_2
3   C        Value_1    pat     D       Value_1
4   C        Value_1    pat     F       Value_2
5   A        Value_2    pat     D       Value_1
6   A        Value_2    pat     F       Value_2

Then if you want to just get the columns shown in the OP you can just extract the columns needed:

newdf = newdf[['Name_x', 'value', 'Name_y']]

Output:

    Name_x  value   Name_y
0   A        cat    D
1   B        mat    E
2   B        mat    E
3   C        pat    D
4   C        pat    F
5   A        pat    D
6   A        pat    F

Then to get rid of index you can set Name_x to index if you like. Hope this helps.

  • Related