I have a dataframe such as
Groups Names
G1 A
G1 A
G1 B
G1 B
G1 C
G1 C
G1 C
G1 D
G2 A
G2 B
G2 C
G3 A
G3 A
G4 F
G4 F
G4 E
And I would like to count for each Groups
the number of duplicated (at least 2 times) of values within the column Names
and add this information on a new column called Nb_duplicated
. And I would like also to add another column called Number_unique_names
which will be the number of unique Names
values within each Groups
.
I should then get:
Groups Names Nb_duplicated Number_unique_names
G1 A 3 4
G1 A 3 4
G1 B 3 4
G1 B 3 4
G1 C 3 4
G1 C 3 4
G1 C 3 4
G1 D 3 4
G2 A 0 3
G2 B 0 3
G2 C 0 3
G3 A 1 1
G3 A 1 1
G4 F 1 2
G4 F 1 2
G4 E 1 2
CodePudding user response:
You can use compute the number of unique and the number of non-duplicated names (with GroupBy.transform
), then subtract the two to get the number of duplicated:
# set up group
g = df.groupby('Groups')
# get unique values
df['unique'] = g['Names'].transform('nunique')
# get non-duplicates
non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
# duplicates = unique - non-duplicates
df['duplicated'] = df['unique'] - non_dup
NB. I used an intermediate variable "non_dup" here for clarity but you can use a one-liner
output (with intermediate non_dup for clarity):
Groups Names unique duplicated non_dup
0 G1 A 4 3 1
1 G1 A 4 3 1
2 G1 B 4 3 1
3 G1 B 4 3 1
4 G1 C 4 3 1
5 G1 C 4 3 1
6 G1 C 4 3 1
7 G1 D 4 3 1
8 G2 A 3 0 3
9 G2 B 3 0 3
10 G2 C 3 0 3
11 G3 A 1 1 0
12 G3 A 1 1 0
13 G4 F 2 1 1
14 G4 F 2 1 1
15 G4 E 2 1 1
CodePudding user response:
Get duplicated values to mask by chain DataFrame.duplicated
with inverted mask and keep=False
for remove unique rows and then use sum
for count True
s in GroupBy.transform
:
m = ~df.duplicated(['Groups','Names'])
m1 = df.duplicated(['Groups','Names'], keep=False)
df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
print (df)
Groups Names Nb_duplicated Number_unique_names
0 G1 A 3 4
1 G1 A 3 4
2 G1 B 3 4
3 G1 B 3 4
4 G1 C 3 4
5 G1 C 3 4
6 G1 C 3 4
7 G1 D 3 4
8 G2 A 0 3
9 G2 B 0 3
10 G2 C 0 3
11 G3 A 1 1
12 G3 A 1 1
13 G4 F 1 2
14 G4 F 1 2
15 G4 E 1 2
Performance is better in sample data, please test also in real data:
np.random.seed(2022)
df = pd.DataFrame({'Groups':np.random.randint(1000, size=10000),
'Names':np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), size=10000)})
In [104]: %%timeit
...: m = ~df.duplicated(['Groups','Names'])
...: m1 = df.duplicated(['Groups','Names'], keep=False)
...: df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
...: df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
...:
6.29 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [105]: %%timeit
...: # set up group
...: g = df.groupby('Groups')
...: # get unique values
...: df['unique'] = g['Names'].transform('nunique')
...: # get non-duplicates
...: non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
...: # duplicates = unique - non-duplicates
...: df['duplicated'] = df['unique'] - non_dup
...:
344 ms ± 8.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)