I have 2 columns in my dataframe, Col A and Col B. I want to create Col C.
The logic of Col C - For each unique value in Col A, when Col B has a '1', '2' or '5' then keep incrementing the number in Col C. For all other numbers store a value of 0 in Col C.
Col A | Col B | Col C |
---|---|---|
A1 | 0 | 0 |
A1 | 1 | 1 |
A1 | 5 | 2 |
A1 | 1 | 3 |
A1 | 2 | 4 |
B1 | 1 | 1 |
B1 | 2 | 2 |
B1 | 3 | 0 |
C1 | 1 | 1 |
C1 | 0 | 0 |
I was able to identify if Col A row 1 is same as row 2 using this
df['Col A'].shift(-1)==df['Col A']
and i can use an if condition on Col B to identify if its a 1 or a 2 or 5
df.loc[(df["Col B"] == 1 | (df["Col B"] == 2) | (df["Col B"] == 5)]
but don't know how to combine these 2 to get the output in Col C.
CodePudding user response:
IIUC, if you want to cumcount except when a non 1/2/5:
m = df['Col B'].isin((1,2,5))
df['Col C'] = m.groupby([df['Col A'], ~m]).cumsum()
Or if you want to systematically reset when a non 1/2/5 is found:
df['Col C'] = m.groupby([df['Col A'], (~m).cumsum()]).cumsum()
output:
Col A Col B Col C
0 A1 0 0
1 A1 1 1
2 A1 5 2
3 A1 1 3
4 A1 2 4
5 B1 1 1
6 B1 2 2
7 B1 3 0
8 C1 1 1
9 C1 0 0
comparison of the two options
As pointed out by @jezrael the question is ambiguous on this point, so you can pick the alternative you need:
m = df['Col B'].isin((1,2,5))
df['Col C (continue)'] = m.groupby([df['Col A'], ~m]).cumsum()
df['Col C (reset)'] = m.groupby([df['Col A'], (~m).cumsum()]).cumsum()
output:
Col A Col B Col C (continue) Col C (reset)
0 A1 0 0 0
1 A1 1 1 1
2 A1 5 2 2
3 A1 1 3 3
4 A1 2 4 4
5 B1 1 1 1
6 B1 2 2 2
7 B1 3 0 0
8 B1 2 3 1 # difference
9 C1 1 1 1
10 C1 0 0 0