Home > Software engineering >  Merging two pandas dataframe by a string type column allowing not exact match
Merging two pandas dataframe by a string type column allowing not exact match

Time:08-12

I have two pandas dataframes, df1 and df2

df1 looks like this:

|Name         |Age  |Country  |Industry|Score  |
-------------------------------------------------
|BIMBOA MM    |8     |CO      |Paper   |5      |
|AGUA* MM     |13    |CH      |Finance |7      |
|ENTEL CI     |5     |PE      |Paper   |2      |
|AXTELCPO MM  |4     |CO      |Oil     |1      |
|ALPEKA MM    |2     |BR      |Oil     |9      |
|KIMBERA MM   |12    |AR      |Finance |1      |

And df2 looks like this:

|Name             |Age  |Country  |Industry|Research Level |
-------------------------------------------------------------
|BIMBOA.MX        |8     |CO      |Paper   |4              |
|AGUA.MX          |13    |CH      |Finance |0              |
|86964WAJ1=1M     |5     |PE      |Paper   |2              |
|USP9810XAA9=1M   |4     |CO      |Oil     |1              |
|KIMBERA.MX       |2     |BR      |Oil     |0              |
|AXTELCPO.MX      |12    |AR      |Finance |3              |

And I want to merge them by the column "Name". But, I want to allow to merge with not exact coincidence of the strings in the column. So I coul get somethin like this:

|Name         |Age  |Country  |Industry|Score  |Research Level |
-----------------------------------------------------------------
|BIMBOA MM    |8     |CO      |Paper   |5      |4              |
|AGUA* MM     |13    |CH      |Finance |7      |0              |
|ENTEL CI     |5     |PE      |Paper   |2      |NaN            |
|AXTELCPO MM  |4     |CO      |Oil     |1      |3              |
|ALPEKA MM    |2     |BR      |Oil     |9      |NaN            |
|KIMBERA MM   |12    |AR      |Finance |1      |0              |

I tried with this:

merged_df1_df2 = pd.merge_asof(df1, df2, on='Name')

But I got this error:

MergeError: Incompatible merge dtype, dtype('O') and dtype('O'), both sides must have numeric dtype

Do you guys any idea of how can I do this?

CodePudding user response:

Given the provided data, it looks like a simple merge on the cleaned-up Name columns should do the trick:

df1.merge(df2[['Research Level']], how='left',
          left_on=df1['Name'].str.extract('(\w )', expand=False),
          right_on=df2['Name'].str.extract('(\w )', expand=False),
         ).drop(columns='key_0')

output:

          Name  Age Country Industry  Score  Research Level
0    BIMBOA MM    8      CO    Paper      5             4.0
1     AGUA* MM   13      CH  Finance      7             0.0
2     ENTEL CI    5      PE    Paper      2             NaN
3  AXTELCPO MM    4      CO      Oil      1             3.0
4    ALPEKA MM    2      BR      Oil      9             NaN
5   KIMBERA MM   12      AR  Finance      1             0.0

CodePudding user response:

import pandas as pd

df1 = pd.DataFrame({'Name': ['BIMBOA MM', 'AGUA* MM', 'ENTEL CI', 'AXTELCPO MM', 'ALPEKA MM', 'KIMBERA MM'],
                    'Age': [8, 13, 5, 4, 2, 12],
                    'Country': ['CO','CH','PE','CO','BR','AR'],
                    'Industry': ['Paper','Finance','Paper','Oil','Oil','Finance'],
                    'Score': [5,7,2,1,9,1]
                    })


df2 = pd.DataFrame({'Name': ['BIMBOA MM', 'AGUA* MM', '86964WAJ1=1M', 'USP9810XAA9=1M', 'KIMBERA MM','AXTELCPO MM'],
                    'Age': [8, 13, 5, 4, 2, 12],
                    'Country': ['CO','CH','PE','CO','BR','AR'],
                    'Industry': ['Paper','Finance','Paper','Oil','Oil','Finance'],
                    'Research_Level': [4,0,2,1,0,3]
                    })



df_left_merge = pd.merge(df1, df2, how='left')

print(df_left_merge)

CodePudding user response:

Use difflib, but got to find the cutoff value that meets your request.

import difflib

def get_closest_match(x, other, cutoff):
    matches = difflib.get_close_matches(x, other, cutoff=cutoff)
    return matches[0] if matches else None


def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
    df_other= df2.copy()
    df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                         for x in df_other[right_on]]
    return df1.merge(df_other, on=left_on, how=how)
fuzzy_merge(df1, df2, 'Name', 'Name', how='left', cutoff=0.5)

enter image description here

Reference: is it possible to do fuzzy match merge with python pandas?

CodePudding user response:

To merge string columns without exact match you need to perform fuzzy merge

You can check different versions of this merge on this question: is it possible to do fuzzy match merge with python pandas?

  • Related