I have a pandas data frame, which looks like the following:
email col2 col3
[email protected] John Doe
[email protected] John Doe
[email protected] John Doe
[email protected] John Doe
[email protected] Jane Doe
I want to go through each email address starting with at least two 'x's and check whether the same email address exists without those 'x's.
Required result:
email col2 col3 exists_in_valid_form
[email protected] John Doe False
[email protected] John Doe True
[email protected] John Doe True
[email protected] John Doe True
[email protected] Jane Doe False
I was able to get a sub-data frame containing all of those rows with the emails starting with 'xx' using df[df['email'].str.contains("xx")]
, and was also able to get the email addresses without the 'x's using str.lstrip('x')
, but neither does not seem to help me get whether this email appears somewhere else without those x's or not.
CodePudding user response:
You can use duplicated()
to get whether a value is existing in other row.
df['exists_in_valid_form'] = df.email.str.lstrip('x').duplicated(keep=False) & df.email.str.startswith('xx')
I added df.email.str.startswith('xx')
to make sure it should start with at least 2 "x" and return False for "[email protected]".