How to add sequential counter column on groups using Pandas groupby?-CodePudding

My question is generated from here.

Instead of creating a column "seq":

    c1 c2  seq
0   A  X   1
1   A  X   2
2   A  Y   1
3   A  Y   2
4   B  X   1
5   B  X   2
6   B  X   3
7   B  Y   1
8   C  X   1
9   C  Y   1
10  C  Y   2
11  C  Y   6

I want to create two columns to count X and Y separately, the count should refresh according to c1 column, like below.

    c1 c2  Ct_X Ct_Y
0   A  X   1    0
1   A  X   2    0
2   A  Y   2    1
3   A  Y   2    2
4   B  X   1    0
5   B  X   2    0
6   B  X   3    0
7   B  Y   3    1
8   C  X   1    0
9   C  Y   1    1
10  C  Y   1    2
11  C  Y   1    3

CodePudding user response：

Try groupby with transform:

x = df.groupby('c1')['c2']
df['Ct_X'] = x.transform(lambda x: x.eq('X').cumsum())
df['Ct_Y'] = x.transform(lambda x: x.eq('Y').cumsum())
print(df)

Output:

   c1 c2  seq  Ct_X  Ct_Y
0   A  X    1     1     0
1   A  X    2     2     0
2   A  Y    1     2     1
3   A  Y    2     2     2
4   B  X    1     1     0
5   B  X    2     2     0
6   B  X    3     3     0
7   B  Y    1     3     1
8   C  X    1     1     0
9   C  Y    1     1     1
10  C  Y    2     1     2
11  C  Y    6     1     3

CodePudding user response：

A similar approach but slightly more generic, we can encode c2 with pd.get_dummies then take the groupby cumsum relative to c1 then we can join the newly created columns back to the original DataFrame:

df = df.join(
    pd.get_dummies(df, columns=['c2'], prefix='Ct').groupby('c1').cumsum()
)

Alternatively str.get_dummies can be used instead and pass the grouping Series directly to groupby:

df = df.join(
    df['c2'].str.get_dummies().groupby(df['c1']).cumsum().add_prefix('Ct_')
)

df:

   c1 c2  Ct_X  Ct_Y
0   A  X     1     0
1   A  X     2     0
2   A  Y     2     1
3   A  Y     2     2
4   B  X     1     0
5   B  X     2     0
6   B  X     3     0
7   B  Y     3     1
8   C  X     1     0
9   C  Y     1     1
10  C  Y     1     2
11  C  Y     1     3

How does this work?

pd.get_dummies produces "dummy/indicator variables" for all values in c2, resulting in:

pd.get_dummies(df, columns=['c2'], prefix='Ct')

   c1  Ct_X  Ct_Y
0   A     1     0
1   A     1     0
2   A     0     1
3   A     0     1
4   B     1     0
5   B     1     0
6   B     1     0
7   B     0     1
8   C     1     0
9   C     0     1
10  C     0     1
11  C     0     1

Now since the indicators are already in 1/0 format we can simply take the cumulative total for each group in c1 resulting in the desired new columns:

pd.get_dummies(df, columns=['c2'], prefix='Ct').groupby('c1').cumsum()

    Ct_X  Ct_Y
0      1     0
1      2     0
2      2     1
3      2     2
4      1     0
5      2     0
6      3     0
7      3     1
8      1     0
9      1     1
10     1     2
11     1     3

Notes

This approach will calculate the cumulative total for all unique values in c2 per c1 group.
If using pd.get_dummies on a DataFrame with more columns than just c1 and c2 the DataFrame may need subset like:

pd.get_dummies(df[['c1', 'c2']], columns=['c2'], prefix='Ct').groupby('c1').cumsum()

Setup:

import pandas as pd

df = pd.DataFrame({
    'c1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'c2': ['X', 'X', 'Y', 'Y', 'X', 'X', 'X', 'Y', 'X', 'Y', 'Y', 'Y']
})