My question is generated from here.
Instead of creating a column "seq":
c1 c2 seq
0 A X 1
1 A X 2
2 A Y 1
3 A Y 2
4 B X 1
5 B X 2
6 B X 3
7 B Y 1
8 C X 1
9 C Y 1
10 C Y 2
11 C Y 6
I want to create two columns to count X and Y separately, the count should refresh according to c1 column, like below.
c1 c2 Ct_X Ct_Y
0 A X 1 0
1 A X 2 0
2 A Y 2 1
3 A Y 2 2
4 B X 1 0
5 B X 2 0
6 B X 3 0
7 B Y 3 1
8 C X 1 0
9 C Y 1 1
10 C Y 1 2
11 C Y 1 3
CodePudding user response:
Try groupby
with transform
:
x = df.groupby('c1')['c2']
df['Ct_X'] = x.transform(lambda x: x.eq('X').cumsum())
df['Ct_Y'] = x.transform(lambda x: x.eq('Y').cumsum())
print(df)
Output:
c1 c2 seq Ct_X Ct_Y
0 A X 1 1 0
1 A X 2 2 0
2 A Y 1 2 1
3 A Y 2 2 2
4 B X 1 1 0
5 B X 2 2 0
6 B X 3 3 0
7 B Y 1 3 1
8 C X 1 1 0
9 C Y 1 1 1
10 C Y 2 1 2
11 C Y 6 1 3
CodePudding user response:
A similar approach but slightly more generic, we can encode c2
with pd.get_dummies
then take the groupby cumsum
relative to c1
then we can join
the newly created columns back to the original DataFrame:
df = df.join(
pd.get_dummies(df, columns=['c2'], prefix='Ct').groupby('c1').cumsum()
)
Alternatively str.get_dummies
can be used instead and pass the grouping Series directly to groupby:
df = df.join(
df['c2'].str.get_dummies().groupby(df['c1']).cumsum().add_prefix('Ct_')
)
df
:
c1 c2 Ct_X Ct_Y
0 A X 1 0
1 A X 2 0
2 A Y 2 1
3 A Y 2 2
4 B X 1 0
5 B X 2 0
6 B X 3 0
7 B Y 3 1
8 C X 1 0
9 C Y 1 1
10 C Y 1 2
11 C Y 1 3
How does this work?
pd.get_dummies
produces "dummy/indicator variables" for all values in c2
, resulting in:
pd.get_dummies(df, columns=['c2'], prefix='Ct')
c1 Ct_X Ct_Y
0 A 1 0
1 A 1 0
2 A 0 1
3 A 0 1
4 B 1 0
5 B 1 0
6 B 1 0
7 B 0 1
8 C 1 0
9 C 0 1
10 C 0 1
11 C 0 1
Now since the indicators are already in 1/0 format we can simply take the cumulative total for each group in c1
resulting in the desired new columns:
pd.get_dummies(df, columns=['c2'], prefix='Ct').groupby('c1').cumsum()
Ct_X Ct_Y
0 1 0
1 2 0
2 2 1
3 2 2
4 1 0
5 2 0
6 3 0
7 3 1
8 1 0
9 1 1
10 1 2
11 1 3
Notes
- This approach will calculate the cumulative total for all unique values in
c2
perc1
group. - If using
pd.get_dummies
on a DataFrame with more columns than justc1
andc2
the DataFrame may need subset like:
pd.get_dummies(df[['c1', 'c2']], columns=['c2'], prefix='Ct').groupby('c1').cumsum()
Setup:
import pandas as pd
df = pd.DataFrame({
'c1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'c2': ['X', 'X', 'Y', 'Y', 'X', 'X', 'X', 'Y', 'X', 'Y', 'Y', 'Y']
})