I have a dataframe that looks like the following:
d = {'id_1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'id_2': [50, 50, 25, 25, 25, 4, 4, 4, 5, 5],
'name': ['Kim', 'Kim', 'Kim', 'Kim', 'Mike', 'Kim', 'Saul', 'Saul', 'Kim', 'Kim']}
df = pd.DataFrame(data=d)
df
| id_1 | id_2 | name |
|------|------|------|
| 1 | 50 | Kim |
| 1 | 50 | Kim |
| 1 | 25 | Kim |
| 1 | 25 | Kim |
| 1 | 25 | Mike |
| 2 | 4 | Kim |
| 2 | 4 | Saul |
| 2 | 4 | Saul |
| 2 | 5 | Kim |
| 2 | 5 | Kim |
I would like to group the dataframe by the two ids and generate a new column showing how many times the name
value appears within each group.
Sample output:
| id_1 | id_2 | name | output |
|------|------|------|--------|
| 1 | 50 | Kim | 2 |
| 1 | 50 | Kim | 2 |
| 1 | 25 | Kim | 2 |
| 1 | 25 | Kim | 2 |
| 1 | 25 | Mike | 1 |
| 2 | 4 | Kim | 1 |
| 2 | 4 | Saul | 2 |
| 2 | 4 | Saul | 2 |
| 2 | 5 | Kim | 2 |
| 2 | 5 | Kim | 2 |
In order to group the data, I am starting something along the lines of
df.groupby(['id_1', 'id_2']).size()
.sort_values(ascending=False)
.reset_index(name='count')
but not sure to actually count and output the number of duplicates into a new column.
CodePudding user response:
df['output'] = df.groupby(['id_1','id_2','name'])['name'].transform(len)
print(df)
Result
id_1 id_2 name output
0 1 50 Kim 2
1 1 50 Kim 2
2 1 25 Kim 2
3 1 25 Kim 2
4 1 25 Mike 1
5 2 4 Kim 1
6 2 4 Saul 2
7 2 4 Saul 2
8 2 5 Kim 2
9 2 5 Kim 2