Pandas : Get the count of values and ID from a pattern of 1 and 0 in a pandas dataframe-CodePudding

I have a dataframe df like below:


import pandas as pd

data = {'A': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'PQR', 'PQR', 'PQR', 'CVB', 'CVB', 'CVB', 'CVB'], 'B': ['2022-02-16 14:00:31', '2022-02-16 16:11:26', '2022-02-16 17:31:26',
                                                                                           '2022-02-16 22:47:46', '2022-02-17 07:11:11', '2022-02-17 10:43:36', 
                                                                                           '2022-02-17 15:05:11', '2022-02-18 18:06:12', '2022-02-19 09:05:46', 
                                                                                           '2022-02-19 13:02:16', '2022-02-19 18:05:26', '2022-02-19 22:05:26'], 'C': [1,0,0,0,1,0,1,0,0,0,0,1]}
df = pd.DataFrame(data)
df['B'] = pd.to_datetime(df['B'])
df


     |   A   |          B           |     C      |         
      ------- ---------------------- ------------ 
     |  XYZ  |  2022-02-16 14:00:31 |     1      |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |
     |  PQR  |  2022-02-17 07:11:11 |     1      |
     |  PQR  |  2022-02-17 10:43:36 |     0      |
     |  PQR  |  2022-02-17 15:05:11 |     1      |
      ------- ---------------------- ------------

What I want to achieve is that I want to count the occurrences of 1 and 0 and assign the count values as a new column of the dataframe df and also add ID as a new column such that the expected output should look like below. For instance, in the column C the count of the pattern 1,0,0,0 for first fours rows is 4 and similarly in the last row there is only value 1 for which the count is 1.


Expected Output :

     |   A   |          B           |     C      |  Count   |    ID   |            
      ------- ---------------------- ------------ ---------- --------- 
     |  XYZ  |  2022-02-16 14:00:31 |     1      |     4    |  ABC_1  |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |     NaN  |         |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |     NaN  |         |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |     NaN  |         |
     |  PQR  |  2022-02-17 07:11:11 |     1      |     2    |  ABC_2  | 
     |  PQR  |  2022-02-17 10:43:36 |     0      |     NaN  |         |
     |  PQR  |  2022-02-17 15:05:11 |     1      |     1    |  ABC_3  |
      ------- ---------------------- ------------ ---------- ---------

Currently, I'm trying to achieve the same by using the code below but I'm unable to get expected/desired results.


one_index = df[df['C'] == 1].index
zero_index = df[df['C'] == 0].index

df.loc[0, 'Count'] = len(df)
df.loc[one_index, 'ID'] = "ABC_1"

Actual Output :

     |   A   |          B           |     C      |  Count   |   ID   |                  
      ------- ---------------------- ------------ ---------- -------- 
     |  XYZ  |  2022-02-16 14:00:31 |     1      |     7    |  ABC_1 |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |    NaN   |        |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |    NaN   |        |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |    NaN   |        |
     |  PQR  |  2022-02-17 07:11:11 |     1      |    NaN   |  ABC_1 |
     |  PQR  |  2022-02-17 10:43:36 |     0      |    NaN   |        |
     |  PQR  |  2022-02-17 15:05:11 |     1      |    NaN   |  ABC_1 |
      ------- ---------------------- ------------ ---------- --------

How can I count the occurrences of the 1 and 0 in the pandas data frame ?

CodePudding user response：

IIUC, you can do it with cumsum on C to create the groups and then groupby.transform to assign the size. Then use again the cumsum to assign the ABC_X where X is the group number plus where to remove the rows with 0 in C.

gr = df['C'].cumsum()
m = df['C'].eq(1)
df.loc[m, 'Count'] = df.groupby(gr).transform('size')
df['ID'] = ('ABC_' gr.astype(str)).where(m,'')
print(df)
#       A                   B  C  Count     ID
# 0   XYZ 2022-02-16 14:00:31  1    4.0  ABC_1
# 1   XYZ 2022-02-16 16:11:26  0    NaN       
# 2   XYZ 2022-02-16 17:31:26  0    NaN       
# 3   XYZ 2022-02-16 22:47:46  0    NaN       
# 4   PQR 2022-02-17 07:11:11  1    2.0  ABC_2
# 5   PQR 2022-02-17 10:43:36  0    NaN       
# 6   PQR 2022-02-17 15:05:11  1    5.0  ABC_3
# 7   PQR 2022-02-18 18:06:12  0    NaN       
# 8   CVB 2022-02-19 09:05:46  0    NaN       
# 9   CVB 2022-02-19 13:02:16  0    NaN       
# 10  CVB 2022-02-19 18:05:26  0    NaN       
# 11  CVB 2022-02-19 22:05:26  1    1.0  ABC_4

NOTE: OP had an issue using groupby.transform on the full dataframe, so

df.loc[m, 'Count'] = df.groupby(gr).transform('size')

can be replaced by:

df.loc[m, 'Count'] = gr.groupby(gr).transform('size') #or count instead of size