Home > Blockchain >  How to isolate duplicates based on partial match in a pandas dataframe
How to isolate duplicates based on partial match in a pandas dataframe

Time:11-18

I have a pandas data frame, which looks like the following:

email                   col2  col3
[email protected]       John  Doe
[email protected]    John  Doe
[email protected]     John  Doe
[email protected]  John  Doe
[email protected]   Jane  Doe

I want to go through each email address starting with at least two 'x's and check whether the same email address exists without those 'x's.

Required result:

email                   col2  col3  exists_in_valid_form
[email protected]       John  Doe   False
[email protected]    John  Doe   True
[email protected]     John  Doe   True
[email protected]  John  Doe   True
[email protected]   Jane  Doe   False

I was able to get a sub-data frame containing all of those rows with the emails starting with 'xx' using df[df['email'].str.contains("xx")], and was also able to get the email addresses without the 'x's using str.lstrip('x'), but neither does not seem to help me get whether this email appears somewhere else without those x's or not.

CodePudding user response:

You can use duplicated() to get whether a value is existing in other row.

df['exists_in_valid_form'] = df.email.str.lstrip('x').duplicated(keep=False) & df.email.str.startswith('xx')

I added df.email.str.startswith('xx') to make sure it should start with at least 2 "x" and return False for "[email protected]".

  • Related