I want to get the streak since a value in a specific column changed:
datetime val1 val2 val1-streak val2-streak
2018-04-01 00:00:00 4 1 0 0
2018-05-01 00:00:00 5 2 0 0
2018-06-01 00:00:00 5 2 1 1
2018-07-01 00:00:00 6 2 0 2
2018-08-01 00:00:00 7 2 0 3
2018-09-01 00:00:00 7 3 1 0
2018-10-01 00:00:00 7 3 2 1
2018-11-01 00:00:00 5 2 0 0
For now it is ok to assume that the dattime column is equaly spaced. That way i can just count the "periods" since the value in a target column changed. Ideally the function can additionally count the difference in datetinme since the value changed last.
I found a few solutions on this site but when i looked more closely they did not really describe my problem:
Pandas - count since last transaction
Counting changes in pandas row values?
In pandas, how do you find the time since a column by value changes, when grouped by another column?
Pythonic way to calculate streaks in pandas dataframe
finding streaks in pandas dataframe
https://joshdevlin.com/blog/calculate-streaks-in-pandas/
.... at least I was not able to transfer the given answers to my problem.
CodePudding user response:
Let's try the following:
(i) Create a stacked Series using set_index
unstack
(this is to use not repeat the same function on two columns separately).
(ii) Use groupby
diff
to find differences among consecutive elements: tmp
(iii) We're interested in the location where the differences are 0; we do another groupby
cumsum
to get the desired outcome.
(iv) Use pivot
to get the output back into the shape of df
.
tmp = df.set_index('datetime')[['val1','val2']].unstack().groupby(level=0).diff()
df[['val1-streak','val2-streak']] = pd.pivot(tmp.eq(0)
.groupby([tmp.index.get_level_values(0),
tmp.ne(0).cumsum()])
.cumsum()
.reset_index(),
'datetime', 'level_0', 0).to_numpy()
Output:
datetime val1 val2 val1-streak val2-streak
0 2018-04-01 00:00:00 4 1 0 0
1 2018-05-01 00:00:00 5 2 0 0
2 2018-06-01 00:00:00 5 2 1 1
3 2018-07-01 00:00:00 6 2 0 2
4 2018-08-01 00:00:00 7 2 0 3
5 2018-09-01 00:00:00 7 3 1 0
6 2018-10-01 00:00:00 7 3 2 1
7 2018-11-01 00:00:00 5 2 0 0
CodePudding user response:
Let's start with how to do it with a single column. First off we need to find each point where the value changes:
diffs = df['val1'].diff(1)
change_points = diffs != 0
Since previously encountered values can show up again we need to find a proper way to distinguish different streaks of the same value. We'll achieve that with a cumulative sum of the change point array. For ease we'll create a temporary DataFrame to hold those results (you can create a new column in your dataframe, but that's a bit messy)
change_points_cumsum = change_points.cumsum()
tmp_merged = pd.concat([serie, change_points_cumsum], axis=1, keys=['val1', 'change_points_cumsum'])
Every streak is assigned a different change_points_cumsum
value so now a simple group by and cumulative count can be applied to get the final result
tmp_merged["val1-streak"] = tmp_merged.groupby(change_points_cumsum).cumcount()
print(tmp_merged)
val1 change_points_cumsum val1-streak
0 4 1 0
1 5 2 0
2 5 2 1
3 6 3 0
4 7 4 0
5 7 4 1
6 7 4 2
7 5 5 0
If you are looking for a shorter, more compact solution
change_points_cumsum = df['val1'].diff(1).ne(0).cumsum()
change_points_cumsum.groupby(change_points_cumsum).cumcount()
0 0
1 0
2 1
3 0
4 0
5 1
6 2
7 0
CodePudding user response:
Use custom function for generate counter by consecutive values by compare differencies per columns specified in list by Series.diff
, compare for not equal by Series.ne
with cumulative sum, last pass to GroupBy.cumcount
:
vals = ['val1','val2']
def f(x):
x = x.diff().ne(0).cumsum()
return x.groupby(x).cumcount()
df = df.join(df[vals].apply(f).add_suffix('_streak'))