Hope you are having a good day. I am currently working with an extremely dirty dataframe containing First Name, Last Name, and Middle Name. One the issues that I am trying to resolve looks like below:
First Name | Last Name |
---|---|
James Agnew | Bond |
James | Bond |
Another similar issue that I am trying to resolve looks like follows:
First Name | Last Name |
---|---|
Jam | Bond |
James | Bond |
Looking forward to your ideas.
Thanks!
Edit: FYI, to make life simpler, I already have data grouped by address which is unique. So, two rows will have one address, another two or three rows will have another address, and so on.
CodePudding user response:
If we just want to check that both elements in row2 are contained in the respective elements of row1, we just need one if statement
row1 = ["James", "Bond"]
row2 = ["Jam", "Bo"]
if row2[0] in row1[0] and row2[1] in row1[1]:
print("Similar!")
else:
print("Not Similar!")
If you want to check the opposite case (that ro1 is in row2), just create a second if statement with 'row1' and 'row2' terms swapped.
CodePudding user response:
This is a not so simple problem. To check if 2 strings are 'similar' you must enter in non-Euclidean distance algorithm. I mean, you must define a similarity function and 'understand' the distance between string.
jellyfish is a library born to solve these problems
Another approach is to collect all names and bind them to a thesaurus of names like this
With a some search, I've found this
hope can help