I am working with two dataframes which look like this: df1
country_1 column1
united states of america abcd
Ireland (Republic of Ireland) efgh
Korea Rep Of fsdf
Switzerland (Swiss Confederation) dsaa
df2
country_2 column2
united states cdda
Ireland ddgd
South Korea rewt
Switzerland tuut
desired output:
country_1 column1 country_2 column2
united states of america abcd united states cdda
Ireland (Republic of Ireland) efgh Ireland ddgd
Korea Rep Of fsdf South Korea rewt
Switzerland (Swiss Confederation) dsaa Switzerland tuut
I am not that familiar with text analytics hence unable to understand any method to tackle this problem. I have tried string matching and regex but its not able to solve this problem.
CodePudding user response:
You can use difflib.
Data:
data1 = {
"country_1": ["united states of america", "Ireland (Republic of Ireland)", "Korea Rep Of", "Switzerland (Swiss Confederation)"],
"column1": ["abcd", "efgh", "fsdf", "dsaa"]
}
df1 = pd.DataFrame(data1)
data2 = {
"country_2": ["united states", "Ireland", "Korea", "Switzerland"],
"column2": ["cdda", "ddgd", "rewt", "tuut"]
}
df2 = pd.DataFrame(data2)
Code:
import difflib
from dataclasses import dataclass
import pandas as pd
@dataclass()
class FuzzyMerge:
"""
Works like pandas merge except also merges on approximate matches.
"""
left: pd.DataFrame
right: pd.DataFrame
left_on: str
right_on: str
how: str = "inner"
cutoff: float = 0.3
def main(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
return self.left.merge(temp, on=self.left_on, how=self.how)
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None
Call the class:
merged = FuzzyMerge(left=df1, right=df2, left_on="country_1", right_on="country_2").main().drop(columns="column2")
print(merged)
Output:
country_1 column1 country_2
0 united states of america abcd united states
1 Ireland (Republic of Ireland) efgh Ireland
2 Korea Rep Of fsdf Korea
3 Switzerland (Swiss Confederation) dsaa Switzerland
CodePudding user response:
you can solve this problem by using pandas operations i.e using join,merge and concat: but I suggest you go through concat first as it is easy to start with
ps: make sure this is in form of Dataframe to convert it into DataFrame
data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)
using concat
data = pd.concat([data1, data2], axis=1)