Home > front end >  Apply Levenshtein distance from rapidfuzz.distance to dataframe with two columns
Apply Levenshtein distance from rapidfuzz.distance to dataframe with two columns

Time:07-11

I have a csv file that looks as follows:

ID; name1; name2
1; John Doe; John Does
2; Mike Johnson; Mike Jonson
3; Leon Mill; Leon Miller
4; Jack Jo; Jack Joe

Now I want to calculate the Levenshtein distance for each pair of name. So compare "John Doe" to "John Does" and put this into a new column. Then the next comparison is made for "Mike Johnson" and "Mike Jonson". So the output would be as follows:

ID; name1; name2;ld
1; John Doe; John Does;1
2; Mike Johnson; Mike Jonson;1
3; Leon Mill; Leon Miller;2
4; Jack Jo; Jack Joe;1

I tried it (see How do I calculate the Levenshtein distance between two Pandas DataFrame columns?) as follows:

from rapidfuzz.distance import Levenshtein
import pandas as pd

df = pd.read_csv(r'C:\Users\myuser\Downloads\Testfile.csv', sep=";")
print(df)

df['ld']=df.apply(lambda x: Levenshtein.distance(df['name1'], df['name2']), axis=1)

But I am getting an error:

KeyError: 'name1'

Where is my mistake?

CodePudding user response:

In lambda function try to call an x variable that defines it.

df['ld']=df.apply(lambda x: Levenshtein.distance(x['name1'], x['name2']), axis=1)
  • Related