I have a following problem. I need to compute a cumcount but I would like to reset the counter always when the series is interupted. See example:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3]}
df = pd.DataFrame.from_dict(data)
I tried this but it gives me a wrong output:
df["seq"] = df.groupby(["col_1"]).cumcount()
What I want is:
data = { 'col_1': ['a', 'a', 'b', 'b', 'a'], 'col_2': [3, 2, 1, 0, -3], 'seq': [0, 1, 0, 1, 0]}
How can I do it, please?
CodePudding user response:
Try:
df["seq"] = df.groupby((df["col_1"] != df["col_1"].shift()).cumsum())["col_1"].cumcount()
print(df)
Output
col_1 col_2 seq
0 a 3 0
1 a 2 1
2 b 1 0
3 b 0 1
4 a -3 0
CodePudding user response:
Note that as you are interested in runs (like in run-length encoding) itertools.groupby
might be better suited for this task, consider following example
import pandas as pd
df = pd.DataFrame({'col1':['a','a','b','b','a']})
df['seq'] = [i for k, g in itertools.groupby(col1) for i in range(len(list(g)))]
print(df)
output
col1 seq
0 a 0
1 a 1
2 b 0
3 b 1
4 a 0