How to determine if two rows are identical (similar) if row 2 contains part of the info from row 1?-CodePudding

Hope you are having a good day. I am currently working with an extremely dirty dataframe containing First Name, Last Name, and Middle Name. One the issues that I am trying to resolve looks like below:

First Name	Last Name
James Agnew	Bond
James	Bond

Another similar issue that I am trying to resolve looks like follows:

First Name	Last Name
Jam	Bond
James	Bond

Looking forward to your ideas.

Thanks!

Edit: FYI, to make life simpler, I already have data grouped by address which is unique. So, two rows will have one address, another two or three rows will have another address, and so on.

CodePudding user response：

If we just want to check that both elements in row2 are contained in the respective elements of row1, we just need one if statement

row1 = ["James", "Bond"]
row2 = ["Jam", "Bo"]

if row2[0] in row1[0] and row2[1] in row1[1]:   
    print("Similar!")
else:
    print("Not Similar!")

If you want to check the opposite case (that ro1 is in row2), just create a second if statement with 'row1' and 'row2' terms swapped.

CodePudding user response：

This is a not so simple problem. To check if 2 strings are 'similar' you must enter in non-Euclidean distance algorithm. I mean, you must define a similarity function and 'understand' the distance between string.

jellyfish is a library born to solve these problems

Another approach is to collect all names and bind them to a thesaurus of names like this

With a some search, I've found this

hope can help