Home > database >  Add column with number of duplicated values within groups and number of unique values within groups
Add column with number of duplicated values within groups and number of unique values within groups

Time:04-14

I have a dataframe such as

Groups Names
G1     A
G1     A
G1     B
G1     B
G1     C
G1     C
G1     C
G1     D
G2     A
G2     B
G2     C
G3     A
G3     A
G4     F
G4     F
G4     E

And I would like to count for each Groups the number of duplicated (at least 2 times) of values within the column Names and add this information on a new column called Nb_duplicated. And I would like also to add another column called Number_unique_names which will be the number of unique Names values within each Groups.

I should then get:

Groups Names Nb_duplicated Number_unique_names
G1     A     3             4
G1     A     3             4
G1     B     3             4
G1     B     3             4
G1     C     3             4
G1     C     3             4
G1     C     3             4
G1     D     3             4
G2     A     0             3
G2     B     0             3
G2     C     0             3
G3     A     1             1
G3     A     1             1
G4     F     1             2
G4     F     1             2
G4     E     1             2

CodePudding user response:

You can use compute the number of unique and the number of non-duplicated names (with GroupBy.transform), then subtract the two to get the number of duplicated:

# set up group
g = df.groupby('Groups')
# get unique values
df['unique'] = g['Names'].transform('nunique')
# get non-duplicates
non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
# duplicates = unique - non-duplicates
df['duplicated'] = df['unique'] - non_dup

NB. I used an intermediate variable "non_dup" here for clarity but you can use a one-liner

output (with intermediate non_dup for clarity):

   Groups Names  unique  duplicated           non_dup
0      G1     A       4           3                 1
1      G1     A       4           3                 1
2      G1     B       4           3                 1
3      G1     B       4           3                 1
4      G1     C       4           3                 1
5      G1     C       4           3                 1
6      G1     C       4           3                 1
7      G1     D       4           3                 1
8      G2     A       3           0                 3
9      G2     B       3           0                 3
10     G2     C       3           0                 3
11     G3     A       1           1                 0
12     G3     A       1           1                 0
13     G4     F       2           1                 1
14     G4     F       2           1                 1
15     G4     E       2           1                 1

CodePudding user response:

Get duplicated values to mask by chain DataFrame.duplicated with inverted mask and keep=False for remove unique rows and then use sum for count Trues in GroupBy.transform:

m = ~df.duplicated(['Groups','Names'])
m1 = df.duplicated(['Groups','Names'], keep=False)
df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
print (df)
   Groups Names  Nb_duplicated  Number_unique_names
0      G1     A              3                    4
1      G1     A              3                    4
2      G1     B              3                    4
3      G1     B              3                    4
4      G1     C              3                    4
5      G1     C              3                    4
6      G1     C              3                    4
7      G1     D              3                    4
8      G2     A              0                    3
9      G2     B              0                    3
10     G2     C              0                    3
11     G3     A              1                    1
12     G3     A              1                    1
13     G4     F              1                    2
14     G4     F              1                    2
15     G4     E              1                    2

Performance is better in sample data, please test also in real data:

np.random.seed(2022)

df = pd.DataFrame({'Groups':np.random.randint(1000, size=10000),
                   'Names':np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), size=10000)})


In [104]: %%timeit
     ...: m = ~df.duplicated(['Groups','Names'])
     ...: m1 = df.duplicated(['Groups','Names'], keep=False)
     ...: df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
     ...: df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
     ...: 
6.29 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [105]: %%timeit
     ...: # set up group
     ...: g = df.groupby('Groups')
     ...: # get unique values
     ...: df['unique'] = g['Names'].transform('nunique')
     ...: # get non-duplicates
     ...: non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
     ...: # duplicates = unique - non-duplicates
     ...: df['duplicated'] = df['unique'] - non_dup
     ...: 
344 ms ± 8.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related