Merging two pandas dataframe by a string type column allowing not exact match-CodePudding

I have two pandas dataframes, df1 and df2

df1 looks like this:

|Name         |Age  |Country  |Industry|Score  |
-------------------------------------------------
|BIMBOA MM    |8     |CO      |Paper   |5      |
|AGUA* MM     |13    |CH      |Finance |7      |
|ENTEL CI     |5     |PE      |Paper   |2      |
|AXTELCPO MM  |4     |CO      |Oil     |1      |
|ALPEKA MM    |2     |BR      |Oil     |9      |
|KIMBERA MM   |12    |AR      |Finance |1      |

And df2 looks like this:

|Name             |Age  |Country  |Industry|Research Level |
-------------------------------------------------------------
|BIMBOA.MX        |8     |CO      |Paper   |4              |
|AGUA.MX          |13    |CH      |Finance |0              |
|86964WAJ1=1M     |5     |PE      |Paper   |2              |
|USP9810XAA9=1M   |4     |CO      |Oil     |1              |
|KIMBERA.MX       |2     |BR      |Oil     |0              |
|AXTELCPO.MX      |12    |AR      |Finance |3              |

And I want to merge them by the column "Name". But, I want to allow to merge with not exact coincidence of the strings in the column. So I coul get somethin like this:

|Name         |Age  |Country  |Industry|Score  |Research Level |
-----------------------------------------------------------------
|BIMBOA MM    |8     |CO      |Paper   |5      |4              |
|AGUA* MM     |13    |CH      |Finance |7      |0              |
|ENTEL CI     |5     |PE      |Paper   |2      |NaN            |
|AXTELCPO MM  |4     |CO      |Oil     |1      |3              |
|ALPEKA MM    |2     |BR      |Oil     |9      |NaN            |
|KIMBERA MM   |12    |AR      |Finance |1      |0              |

I tried with this:

merged_df1_df2 = pd.merge_asof(df1, df2, on='Name')

But I got this error:

MergeError: Incompatible merge dtype, dtype('O') and dtype('O'), both sides must have numeric dtype

Do you guys any idea of how can I do this?

CodePudding user response：

Given the provided data, it looks like a simple merge on the cleaned-up Name columns should do the trick:

df1.merge(df2[['Research Level']], how='left',
          left_on=df1['Name'].str.extract('(\w )', expand=False),
          right_on=df2['Name'].str.extract('(\w )', expand=False),
         ).drop(columns='key_0')

output:

          Name  Age Country Industry  Score  Research Level
0    BIMBOA MM    8      CO    Paper      5             4.0
1     AGUA* MM   13      CH  Finance      7             0.0
2     ENTEL CI    5      PE    Paper      2             NaN
3  AXTELCPO MM    4      CO      Oil      1             3.0
4    ALPEKA MM    2      BR      Oil      9             NaN
5   KIMBERA MM   12      AR  Finance      1             0.0

CodePudding user response：

import pandas as pd

df1 = pd.DataFrame({'Name': ['BIMBOA MM', 'AGUA* MM', 'ENTEL CI', 'AXTELCPO MM', 'ALPEKA MM', 'KIMBERA MM'],
                    'Age': [8, 13, 5, 4, 2, 12],
                    'Country': ['CO','CH','PE','CO','BR','AR'],
                    'Industry': ['Paper','Finance','Paper','Oil','Oil','Finance'],
                    'Score': [5,7,2,1,9,1]
                    })


df2 = pd.DataFrame({'Name': ['BIMBOA MM', 'AGUA* MM', '86964WAJ1=1M', 'USP9810XAA9=1M', 'KIMBERA MM','AXTELCPO MM'],
                    'Age': [8, 13, 5, 4, 2, 12],
                    'Country': ['CO','CH','PE','CO','BR','AR'],
                    'Industry': ['Paper','Finance','Paper','Oil','Oil','Finance'],
                    'Research_Level': [4,0,2,1,0,3]
                    })



df_left_merge = pd.merge(df1, df2, how='left')

print(df_left_merge)

CodePudding user response：

Use difflib, but got to find the cutoff value that meets your request.

import difflib

def get_closest_match(x, other, cutoff):
    matches = difflib.get_close_matches(x, other, cutoff=cutoff)
    return matches[0] if matches else None


def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
    df_other= df2.copy()
    df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                         for x in df_other[right_on]]
    return df1.merge(df_other, on=left_on, how=how)

fuzzy_merge(df1, df2, 'Name', 'Name', how='left', cutoff=0.5)

Reference: is it possible to do fuzzy match merge with python pandas?

CodePudding user response：

To merge string columns without exact match you need to perform fuzzy merge

You can check different versions of this merge on this question: is it possible to do fuzzy match merge with python pandas?