I want to merge these example dataframes:
- How to get the closest matches in a new df?
df1:
name age department
DJ Griffin 27 FD
Harris Smith 33 RD
df2:
name age department
D.J. Griffin III 27 FD
Harris Smith 33 RD
Miles Jones 58 RD
The result should look like:
df3:
name age department name_y
DJ Griffin 27 FD D.J. Griffin III
Harris Smith 33 RD Harris Smith
Used Difflib but got an error, cause of the different length of the dfs.
import pandas as pd
import difflib
df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])
df2['name_y'] = df2['name']
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
Result:
IndexError: list index out of range
- How to find the closest match by two columns when there is another Harris Smith with the age of 45?
For the duplicate Harris Smith case
df1:
name age department
DJ Griffin 27 FD
Harris Smith 33 RD
Harris Smith 45 BA
df2:
name age department
D.J. Griffin III 27 FD
Harris Smith 33 RD
Harris Smith 45 BA
Miles Jones 58 RD
The result should look like:
df3:
name age department name_y
DJ Griffin 27 FD D.J. Griffin III
Harris Smith 33 RD Harris Smith
Harris Smith 45 BA Harris Smith
import pandas as pd
import difflib
df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])
df2['name_y'] = df2['name']
Thanks for your help.
CodePudding user response:
The issue arises when you have zero match, slicing [0]
is not possible.
You could use instead:
df2['name'].apply(lambda x: next(iter(difflib.get_close_matches(x, df1['name'])), pd.NA))
or
df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])).str[0]
output:
0 DJ Griffin
1 Harris Smith
2 NaN
Name: name, dtype: object
update:
df1.merge(df2[['name', 'age']]
.assign(name_y=df2['name'],
name=df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])))
.explode('name')
.drop_duplicates(),
on=['name', 'age']
)
output:
name age department name_y
0 DJ Griffin 27 FD D.J. Griffin III
1 Harris Smith 33 RD Harris Smith
2 Harris Smith 45 BA Harris Smith