Python merge two dataframe based on text similarity of their columns-CodePudding

I am working with two dataframes which look like this: df1

country_1                               column1
united states of america                   abcd
Ireland (Republic of Ireland)              efgh
Korea Rep Of                               fsdf
Switzerland (Swiss Confederation)          dsaa

df2

country_2                               column2
united states                              cdda
Ireland                                    ddgd
South Korea                                rewt
Switzerland                                tuut

desired output:

country_1                               column1              country_2                column2
united states of america                   abcd              united states            cdda
Ireland (Republic of Ireland)              efgh              Ireland                  ddgd
Korea Rep Of                               fsdf              South Korea              rewt
Switzerland (Swiss Confederation)          dsaa              Switzerland              tuut

I am not that familiar with text analytics hence unable to understand any method to tackle this problem. I have tried string matching and regex but its not able to solve this problem.

CodePudding user response：

You can use difflib.

Data:

data1 = {
    "country_1": ["united states of america", "Ireland (Republic of Ireland)", "Korea Rep Of", "Switzerland (Swiss Confederation)"],
    "column1": ["abcd", "efgh", "fsdf", "dsaa"]
}
df1 = pd.DataFrame(data1)

data2 = {
    "country_2": ["united states", "Ireland", "Korea", "Switzerland"],
    "column2": ["cdda", "ddgd", "rewt", "tuut"]
}
df2 = pd.DataFrame(data2)

Code:

import difflib
from dataclasses import dataclass

import pandas as pd



@dataclass()
class FuzzyMerge:
    """
        Works like pandas merge except also merges on approximate matches.
    """
    left: pd.DataFrame
    right: pd.DataFrame
    left_on: str
    right_on: str
    how: str = "inner"
    cutoff: float = 0.3

    def main(self) -> pd.DataFrame:
        temp = self.right.copy()
        temp[self.left_on] = [
            self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
        ]

        return self.left.merge(temp, on=self.left_on, how=self.how)

    def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
        matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)

        return matches[0] if matches else None

Call the class:

merged = FuzzyMerge(left=df1, right=df2, left_on="country_1", right_on="country_2").main().drop(columns="column2")
print(merged)

Output:

                           country_1 column1      country_2
0           united states of america    abcd  united states
1      Ireland (Republic of Ireland)    efgh        Ireland
2                       Korea Rep Of    fsdf          Korea
3  Switzerland (Swiss Confederation)    dsaa    Switzerland

CodePudding user response：

you can solve this problem by using pandas operations i.e using join,merge and concat: but I suggest you go through concat first as it is easy to start with

ps: make sure this is in form of Dataframe to convert it into DataFrame

data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)

using concat

data = pd.concat([data1, data2], axis=1)