Home > OS >  Output number of times a duplicate value appears within grouped data
Output number of times a duplicate value appears within grouped data

Time:09-26

I have a dataframe that looks like the following:

d = {'id_1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], 
     'id_2': [50, 50, 25, 25, 25, 4, 4, 4, 5, 5],
    'name': ['Kim', 'Kim', 'Kim', 'Kim', 'Mike', 'Kim', 'Saul', 'Saul', 'Kim', 'Kim']}
df = pd.DataFrame(data=d)
df

| id_1 | id_2 | name |
|------|------|------|
| 1    | 50   | Kim  | 
| 1    | 50   | Kim  | 
| 1    | 25   | Kim  | 
| 1    | 25   | Kim  | 
| 1    | 25   | Mike | 
| 2    | 4    | Kim  | 
| 2    | 4    | Saul | 
| 2    | 4    | Saul | 
| 2    | 5    | Kim  | 
| 2    | 5    | Kim  |

I would like to group the dataframe by the two ids and generate a new column showing how many times the name value appears within each group.

Sample output:

| id_1 | id_2 | name | output |
|------|------|------|--------|
| 1    | 50   | Kim  | 2      |
| 1    | 50   | Kim  | 2      |
| 1    | 25   | Kim  | 2      |
| 1    | 25   | Kim  | 2      |
| 1    | 25   | Mike | 1      |
| 2    | 4    | Kim  | 1      |
| 2    | 4    | Saul | 2      |
| 2    | 4    | Saul | 2      |
| 2    | 5    | Kim  | 2      |
| 2    | 5    | Kim  | 2      |

In order to group the data, I am starting something along the lines of

df.groupby(['id_1', 'id_2']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count')

but not sure to actually count and output the number of duplicates into a new column.

CodePudding user response:

df['output'] = df.groupby(['id_1','id_2','name'])['name'].transform(len)
print(df)

Result

   id_1  id_2  name  output
0     1    50   Kim       2
1     1    50   Kim       2
2     1    25   Kim       2
3     1    25   Kim       2
4     1    25  Mike       1
5     2     4   Kim       1
6     2     4  Saul       2
7     2     4  Saul       2
8     2     5   Kim       2
9     2     5   Kim       2
  • Related