I have a dataframe df
like below:
import pandas as pd
data = {'A': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'XYZ', 'XYZ', 'ABC', 'XYZ', 'ABC'], 'B': ['2022-02-16 14:00:31', '2022-02-16 16:11:26', '2022-02-16 17:31:26',
'2022-02-16 22:47:46', '2022-02-17 07:11:11', '2022-02-17 10:43:36',
'2022-02-17 15:05:11', '2022-02-17 15:07:25', '2022-02-17 15:08:35',
'2022-02-17 15:09:46'], 'C': [1,0,0,0,1,0,0,1,0,0]}
df = pd.DataFrame(data)
df['B'] = pd.to_datetime(df['B'])
df
| A | B | C |
------- ---------------------- ------------
| XYZ | 2022-02-16 14:00:31 | 1 |
| XYZ | 2022-02-16 16:11:26 | 0 |
| XYZ | 2022-02-16 17:31:26 | 0 |
| XYZ | 2022-02-16 22:47:46 | 0 |
| PQR | 2022-02-17 07:11:11 | 1 |
| XYZ | 2022-02-17 10:43:36 | 0 |
| XYZ | 2022-02-17 15:05:11 | 0 |
| ABC | 2022-02-17 15:07:25 | 1 |
| XYZ | 2022-02-17 15:08:35 | 0 |
| ABC | 2022-02-17 15:09:46 | 0 |
------- ---------------------- ------------
What I want to achieve is that I want to count the number of duplicates for each of the occurences of XYZ
, PQR
, ABC
such that I get an output like below.
Expected Output :
| A | B | C | Count |
------- ---------------------- ------------ ----------
| XYZ | 2022-02-16 14:00:31 | 1 | 7 |
| XYZ | 2022-02-16 16:11:26 | 0 | |
| XYZ | 2022-02-16 17:31:26 | 0 | |
| XYZ | 2022-02-16 22:47:46 | 0 | |
| PQR | 2022-02-17 07:11:11 | 1 | 1 |
| XYZ | 2022-02-17 10:43:36 | 0 | |
| XYZ | 2022-02-17 15:05:11 | 0 | |
| ABC | 2022-02-17 15:07:25 | 1 | 2 |
| XYZ | 2022-02-17 15:08:35 | 0 | |
| ABC | 2022-02-17 15:09:45 | 0 | |
------- ---------------------- ------------ ----------
Currently, I'm trying to achieve the same by using the code below but I'm unable to get expected/desired results.
one_index = df[df['C'] == 1].index
zero_index = df[df['C'] == 0].index
df.loc[0, 'Count'] = len(df)
| A | B | C | Count |
------- ---------------------- ------------ ----------
| XYZ | 2022-02-16 14:00:31 | 1 | 10 |
| XYZ | 2022-02-16 16:11:26 | 0 | |
| XYZ | 2022-02-16 17:31:26 | 0 | |
| XYZ | 2022-02-16 22:47:46 | 0 | |
| PQR | 2022-02-17 07:11:11 | 1 | |
| XYZ | 2022-02-17 10:43:36 | 0 | |
| XYZ | 2022-02-17 15:05:11 | 0 | |
| ABC | 2022-02-17 15:07:25 | 1 | |
| XYZ | 2022-02-17 15:08:35 | 0 | |
| ABC | 2022-02-17 15:09:45 | 0 | |
------- ---------------------- ------------ ----------
So, how can I get count of duplicates for each of the values of column A
as mentioned ?
EDIT (OPTIONAL):
I would also like to have ID assigned to the groups for the df
after the count values has been assigned. So, my final dataframe should look like below after ID has been assigned:
| A | B | C | Count | ID |
------- ---------------------- ------------ ---------- -------
| XYZ | 2022-02-16 14:00:31 | 1 | 7 | ABC_1 |
| XYZ | 2022-02-16 16:11:26 | 0 | | |
| XYZ | 2022-02-16 17:31:26 | 0 | | |
| XYZ | 2022-02-16 22:47:46 | 0 | | |
| PQR | 2022-02-17 07:11:11 | 1 | 1 | ABC_2 |
| XYZ | 2022-02-17 10:43:36 | 0 | | |
| XYZ | 2022-02-17 15:05:11 | 0 | | |
| ABC | 2022-02-17 15:07:25 | 1 | 2 | ABC_3 |
| XYZ | 2022-02-17 15:08:35 | 0 | | |
| ABC | 2022-02-17 15:09:45 | 0 | | |
------- ---------------------- ------------ ---------- -------
CodePudding user response:
Use GroupBy.transform
with set empty strings for non 1
values in C
by Series.where
:
df['B'] = pd.to_datetime(df['B'])
m = df.C.eq(1)
df['Count'] = df.groupby('A')['C'].transform('size').where(m, '')
df.loc[m, 'ID'] = 'ABC_' pd.RangeIndex(1, m.sum() 1).astype(str)
df['ID'] = df['ID'].fillna('')
print (df)
A B C Count ID
0 XYZ 2022-02-16 14:00:31 1 7 ABC_1
1 XYZ 2022-02-16 16:11:26 0
2 XYZ 2022-02-16 17:31:26 0
3 XYZ 2022-02-16 22:47:46 0
4 PQR 2022-02-17 07:11:11 1 1 ABC_2
5 XYZ 2022-02-17 10:43:36 0
6 XYZ 2022-02-17 15:05:11 0
7 ABC 2022-02-17 15:07:25 1 2 ABC_3
8 XYZ 2022-02-17 15:08:35 0
9 ABC 2022-02-17 15:09:46 0
Or:
df['B'] = pd.to_datetime(df['B'])
m = df.C.eq(1)
df['Count'] = df.groupby('A')['C'].transform('size').where(m, '')
df['ID'] = ('ABC_' df['C'].cumsum().astype(str)).where(m, '')
print (df)
A B C Count ID
0 XYZ 2022-02-16 14:00:31 1 7 ABC_1
1 XYZ 2022-02-16 16:11:26 0
2 XYZ 2022-02-16 17:31:26 0
3 XYZ 2022-02-16 22:47:46 0
4 PQR 2022-02-17 07:11:11 1 1 ABC_2
5 XYZ 2022-02-17 10:43:36 0
6 XYZ 2022-02-17 15:05:11 0
7 ABC 2022-02-17 15:07:25 1 2 ABC_3
8 XYZ 2022-02-17 15:08:35 0
9 ABC 2022-02-17 15:09:46 0
CodePudding user response:
Use value_counts
and map
with boolean indexing:
df['Count'] = df.loc[df['C'].eq(1), 'A'].map(df['A'].value_counts())
output:
A B C Count
0 XYZ 2022-02-16 14:00:31 1 7.0
1 XYZ 2022-02-16 16:11:26 0 NaN
2 XYZ 2022-02-16 17:31:26 0 NaN
3 XYZ 2022-02-16 22:47:46 0 NaN
4 PQR 2022-02-17 07:11:11 1 1.0
5 XYZ 2022-02-17 10:43:36 0 NaN
6 XYZ 2022-02-17 15:05:11 0 NaN
7 ABC 2022-02-17 15:07:25 1 2.0
8 XYZ 2022-02-17 15:08:35 0 NaN
9 ABC 2022-02-17 15:09:46 0 NaN