I have a dataframe df
like below:
import pandas as pd
data = {'A': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'PQR', 'PQR', 'PQR', 'CVB', 'CVB', 'CVB', 'CVB'], 'B': ['2022-02-16 14:00:31', '2022-02-16 16:11:26', '2022-02-16 17:31:26',
'2022-02-16 22:47:46', '2022-02-17 07:11:11', '2022-02-17 10:43:36',
'2022-02-17 15:05:11', '2022-02-18 18:06:12', '2022-02-19 09:05:46',
'2022-02-19 13:02:16', '2022-02-19 18:05:26', '2022-02-19 22:05:26'], 'C': [1,0,0,0,1,0,1,0,0,0,0,1]}
df = pd.DataFrame(data)
df['B'] = pd.to_datetime(df['B'])
df
| A | B | C |
------- ---------------------- ------------
| XYZ | 2022-02-16 14:00:31 | 1 |
| XYZ | 2022-02-16 16:11:26 | 0 |
| XYZ | 2022-02-16 17:31:26 | 0 |
| XYZ | 2022-02-16 22:47:46 | 0 |
| PQR | 2022-02-17 07:11:11 | 1 |
| PQR | 2022-02-17 10:43:36 | 0 |
| PQR | 2022-02-17 15:05:11 | 1 |
------- ---------------------- ------------
What I want to achieve is that I want to count the occurrences of 1 and 0 and assign the count values as a new column of the dataframe df
and also add ID
as a new column such that the expected output should look like below. For instance, in the column C
the count of the pattern 1,0,0,0
for first fours rows is 4 and similarly in the last row there is only value 1
for which the count is 1.
Expected Output :
| A | B | C | Count | ID |
------- ---------------------- ------------ ---------- ---------
| XYZ | 2022-02-16 14:00:31 | 1 | 4 | ABC_1 |
| XYZ | 2022-02-16 16:11:26 | 0 | NaN | |
| XYZ | 2022-02-16 17:31:26 | 0 | NaN | |
| XYZ | 2022-02-16 22:47:46 | 0 | NaN | |
| PQR | 2022-02-17 07:11:11 | 1 | 2 | ABC_2 |
| PQR | 2022-02-17 10:43:36 | 0 | NaN | |
| PQR | 2022-02-17 15:05:11 | 1 | 1 | ABC_3 |
------- ---------------------- ------------ ---------- ---------
Currently, I'm trying to achieve the same by using the code below but I'm unable to get expected/desired results.
one_index = df[df['C'] == 1].index
zero_index = df[df['C'] == 0].index
df.loc[0, 'Count'] = len(df)
df.loc[one_index, 'ID'] = "ABC_1"
Actual Output :
| A | B | C | Count | ID |
------- ---------------------- ------------ ---------- --------
| XYZ | 2022-02-16 14:00:31 | 1 | 7 | ABC_1 |
| XYZ | 2022-02-16 16:11:26 | 0 | NaN | |
| XYZ | 2022-02-16 17:31:26 | 0 | NaN | |
| XYZ | 2022-02-16 22:47:46 | 0 | NaN | |
| PQR | 2022-02-17 07:11:11 | 1 | NaN | ABC_1 |
| PQR | 2022-02-17 10:43:36 | 0 | NaN | |
| PQR | 2022-02-17 15:05:11 | 1 | NaN | ABC_1 |
------- ---------------------- ------------ ---------- --------
How can I count the occurrences of the 1 and 0 in the pandas data frame ?
CodePudding user response:
IIUC, you can do it with cumsum
on C to create the groups and then groupby.transform
to assign the size
. Then use again the cumsum
to assign the ABC_X where X is the group number plus where to remove the rows with 0 in C.
gr = df['C'].cumsum()
m = df['C'].eq(1)
df.loc[m, 'Count'] = df.groupby(gr).transform('size')
df['ID'] = ('ABC_' gr.astype(str)).where(m,'')
print(df)
# A B C Count ID
# 0 XYZ 2022-02-16 14:00:31 1 4.0 ABC_1
# 1 XYZ 2022-02-16 16:11:26 0 NaN
# 2 XYZ 2022-02-16 17:31:26 0 NaN
# 3 XYZ 2022-02-16 22:47:46 0 NaN
# 4 PQR 2022-02-17 07:11:11 1 2.0 ABC_2
# 5 PQR 2022-02-17 10:43:36 0 NaN
# 6 PQR 2022-02-17 15:05:11 1 5.0 ABC_3
# 7 PQR 2022-02-18 18:06:12 0 NaN
# 8 CVB 2022-02-19 09:05:46 0 NaN
# 9 CVB 2022-02-19 13:02:16 0 NaN
# 10 CVB 2022-02-19 18:05:26 0 NaN
# 11 CVB 2022-02-19 22:05:26 1 1.0 ABC_4
NOTE: OP had an issue using groupby.transform
on the full dataframe, so
df.loc[m, 'Count'] = df.groupby(gr).transform('size')
can be replaced by:
df.loc[m, 'Count'] = gr.groupby(gr).transform('size') #or count instead of size