I have the following dataframe:
x=df[['PROJECT','ID']]
print(x)
PROJECT ID
0 1 10
1 1 32
2 1 32
3 1 9
4 1 32
5 1 10
6 2 10
7 2 33
8 2 54
9 2 9
10 2 31
11 2 10
12 3 10
13 3 54
14 3 53
15 3 11
16 3 33
17 3 54
I want to add a third column titled 'Unique' with values 'Yes' or 'No'. If the value of an index in the column with label 'ID' is present more than once AND they have the same value for 'PROJECT', than the value must be attributed a 'No', otherwise 'Yes'. So the outcome I'd like to have would look like:
PROJECT ID Unique
0 1 10 No
1 1 32 No
2 1 32 No
3 1 9 Yes
4 1 32 No
5 1 10 No
6 2 10 No
7 2 33 Yes
8 2 54 Yes
9 2 9 Yes
10 2 31 Yes
11 2 10 No
12 3 10 Yes
13 3 54 No
14 3 53 Yes
15 3 11 Yes
16 3 33 Yes
17 3 54 No
You basically have to see rows with the same PROJECT value as a subset for which you have to determine the uniqueness of the corresponding ID values. Rows with different values for 'PROJECT' have no connection. I tried several things with combining either np.groupby
, np.isin
or np.unique
with some sort of loop but it hasn't worked yet.
CodePudding user response:
Group by 'Project' and for each group mark the 'ID' duplicates with True using Series.duplicated
. Then map the True values (i.e. duplicates) to 'No', and False to 'Yes' using Series.map
.
df['Unique'] = (
df.groupby('PROJECT')['ID']
.apply(lambda g: g.duplicated(keep=False))
.map({True:'No', False:'Yes'})
)
>>> df
PROJECT ID Unique
0 1 10 No
1 1 32 No
2 1 32 No
3 1 9 Yes
4 1 32 No
5 1 10 No
6 2 10 No
7 2 33 Yes
8 2 54 Yes
9 2 9 Yes
10 2 31 Yes
11 2 10 No
12 3 10 Yes
13 3 54 No
14 3 53 Yes
15 3 11 Yes
16 3 33 Yes
17 3 54 No
CodePudding user response:
We can use DataFrame.duplicated
with the subset
parameter set to only consider certain columns. Then we can convert the boolean values to 'Yes'
/'No'
with np.where
:
df['Unique'] = np.where(
df.duplicated(subset=['PROJECT', 'ID'], keep=False), 'No', 'Yes'
)
df
:
PROJECT ID Unique
0 1 10 No
1 1 32 No
2 1 32 No
3 1 9 Yes
4 1 32 No
5 1 10 No
6 2 10 No
7 2 33 Yes
8 2 54 Yes
9 2 9 Yes
10 2 31 Yes
11 2 10 No
12 3 10 Yes
13 3 54 No
14 3 53 Yes
15 3 11 Yes
16 3 33 Yes
17 3 54 No
Setup:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'PROJECT': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3],
'ID': [10, 32, 32, 9, 32, 10, 10, 33, 54, 9, 31, 10, 10, 54, 53, 11, 33, 54]
})