Home > Software engineering >  Labeling a dataframe based on unique values
Labeling a dataframe based on unique values

Time:11-12

I have the following dataframe:

x=df[['PROJECT','ID']] 
print(x)



        PROJECT       ID
0           1         10
1           1         32
2           1         32
3           1          9
4           1         32
5           1         10
6           2         10
7           2         33
8           2         54
9           2          9
10          2         31
11          2         10
12          3         10
13          3         54
14          3         53
15          3         11
16          3         33
17          3         54

I want to add a third column titled 'Unique' with values 'Yes' or 'No'. If the value of an index in the column with label 'ID' is present more than once AND they have the same value for 'PROJECT', than the value must be attributed a 'No', otherwise 'Yes'. So the outcome I'd like to have would look like:

         PROJECT       ID     Unique
0           1         10      No
1           1         32      No
2           1         32      No
3           1          9      Yes
4           1         32      No
5           1         10      No
6           2         10      No
7           2         33      Yes
8           2         54      Yes
9           2          9      Yes
10          2         31      Yes
11          2         10      No
12          3         10      Yes
13          3         54      No
14          3         53      Yes
15          3         11      Yes
16          3         33      Yes
17          3         54      No

You basically have to see rows with the same PROJECT value as a subset for which you have to determine the uniqueness of the corresponding ID values. Rows with different values for 'PROJECT' have no connection. I tried several things with combining either np.groupby , np.isin or np.unique with some sort of loop but it hasn't worked yet.

CodePudding user response:

Group by 'Project' and for each group mark the 'ID' duplicates with True using Series.duplicated. Then map the True values (i.e. duplicates) to 'No', and False to 'Yes' using Series.map.

df['Unique'] = (
    df.groupby('PROJECT')['ID']
      .apply(lambda g: g.duplicated(keep=False))
      .map({True:'No', False:'Yes'})
)

>>> df

    PROJECT  ID Unique
0         1  10     No
1         1  32     No
2         1  32     No
3         1   9    Yes
4         1  32     No
5         1  10     No
6         2  10     No
7         2  33    Yes
8         2  54    Yes
9         2   9    Yes
10        2  31    Yes
11        2  10     No
12        3  10    Yes
13        3  54     No
14        3  53    Yes
15        3  11    Yes
16        3  33    Yes
17        3  54     No

CodePudding user response:

We can use DataFrame.duplicated with the subset parameter set to only consider certain columns. Then we can convert the boolean values to 'Yes'/'No' with np.where:

df['Unique'] = np.where(
    df.duplicated(subset=['PROJECT', 'ID'], keep=False), 'No', 'Yes'
)

df:

    PROJECT  ID Unique
0         1  10     No
1         1  32     No
2         1  32     No
3         1   9    Yes
4         1  32     No
5         1  10     No
6         2  10     No
7         2  33    Yes
8         2  54    Yes
9         2   9    Yes
10        2  31    Yes
11        2  10     No
12        3  10    Yes
13        3  54     No
14        3  53    Yes
15        3  11    Yes
16        3  33    Yes
17        3  54     No

Setup:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'PROJECT': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3],
    'ID': [10, 32, 32, 9, 32, 10, 10, 33, 54, 9, 31, 10, 10, 54, 53, 11, 33, 54]
})
  • Related