Based on the code below, I would expect the first element of the 'duplicate' column to return 'True' since it exists in 'df_set'. This is for a much larger data-set, hence the use of converting to a set...
What am I doing incorrectly that is causing the first element of 'duplicate' to return 'False?
import numpy as np
import pandas as pd
data = [
['tom', 'juli'],
['nick', 'heather'],
['juli', 'john'],
['dustin', 'tracy']
]
columns = ['Name', 'Name2']
df = pd.DataFrame(data, columns = columns)
df_set = set(df['Name'])
df['duplicate'] = np.isin(df['Name2'], df_set, assume_unique=True)
print(df)
Output:
Name Name2 duplicate
0 tom juli False
1 nick heather False
2 juli john False
3 dustin tracy False
CodePudding user response:
numpy doesn't seem to like sets, so you should convert the set back to a list:
df['duplicate'] = np.isin(df['Name2'], list(df_set), assume_unique=True)
Output:
>>> df
Name Name2 duplicate
0 tom juli True
1 nick heather False
2 juli john False
3 dustin tracy False
CodePudding user response:
Another way, could still evaluate within df;
df['duplicate'] =df['Name2'].isin(set(df['Name']))
Name Name2 duplicate
0 tom juli True
1 nick heather False
2 juli john False
3 dustin tracy False