I have a csv file that looks as follows:
ID; name1; name2
1; John Doe; John Does
2; Mike Johnson; Mike Jonson
3; Leon Mill; Leon Miller
4; Jack Jo; Jack Joe
Now I want to calculate the Levenshtein distance for each pair of name. So compare "John Doe" to "John Does" and put this into a new column. Then the next comparison is made for "Mike Johnson" and "Mike Jonson". So the output would be as follows:
ID; name1; name2;ld
1; John Doe; John Does;1
2; Mike Johnson; Mike Jonson;1
3; Leon Mill; Leon Miller;2
4; Jack Jo; Jack Joe;1
I tried it (see How do I calculate the Levenshtein distance between two Pandas DataFrame columns?) as follows:
from rapidfuzz.distance import Levenshtein
import pandas as pd
df = pd.read_csv(r'C:\Users\myuser\Downloads\Testfile.csv', sep=";")
print(df)
df['ld']=df.apply(lambda x: Levenshtein.distance(df['name1'], df['name2']), axis=1)
But I am getting an error:
KeyError: 'name1'
Where is my mistake?
CodePudding user response:
In lambda function try to call an x variable that defines it.
df['ld']=df.apply(lambda x: Levenshtein.distance(x['name1'], x['name2']), axis=1)