Pandas - count streak since last value change-CodePudding

I want to get the streak since a value in a specific column changed:

datetime                 val1          val2  val1-streak   val2-streak
2018-04-01 00:00:00        4           1        0             0
2018-05-01 00:00:00        5           2        0             0
2018-06-01 00:00:00        5           2        1             1
2018-07-01 00:00:00        6           2        0             2
2018-08-01 00:00:00        7           2        0             3
2018-09-01 00:00:00        7           3        1             0
2018-10-01 00:00:00        7           3        2             1
2018-11-01 00:00:00        5           2        0             0

For now it is ok to assume that the dattime column is equaly spaced. That way i can just count the "periods" since the value in a target column changed. Ideally the function can additionally count the difference in datetinme since the value changed last.

I found a few solutions on this site but when i looked more closely they did not really describe my problem:

Pandas - count since last transaction

Counting changes in pandas row values?

In pandas, how do you find the time since a column by value changes, when grouped by another column?

Pythonic way to calculate streaks in pandas dataframe

finding streaks in pandas dataframe

https://joshdevlin.com/blog/calculate-streaks-in-pandas/

.... at least I was not able to transfer the given answers to my problem.

CodePudding user response：

Let's try the following:

(i) Create a stacked Series using set_index unstack (this is to use not repeat the same function on two columns separately).

(ii) Use groupby diff to find differences among consecutive elements: tmp

(iii) We're interested in the location where the differences are 0; we do another groupby cumsum to get the desired outcome.

(iv) Use pivot to get the output back into the shape of df.

tmp = df.set_index('datetime')[['val1','val2']].unstack().groupby(level=0).diff()
df[['val1-streak','val2-streak']] = pd.pivot(tmp.eq(0)
                                             .groupby([tmp.index.get_level_values(0), 
                                                       tmp.ne(0).cumsum()])
                                             .cumsum()
                                             .reset_index(), 
                                             'datetime', 'level_0', 0).to_numpy()

Output:

              datetime  val1  val2  val1-streak  val2-streak
0  2018-04-01 00:00:00     4     1            0            0
1  2018-05-01 00:00:00     5     2            0            0
2  2018-06-01 00:00:00     5     2            1            1
3  2018-07-01 00:00:00     6     2            0            2
4  2018-08-01 00:00:00     7     2            0            3
5  2018-09-01 00:00:00     7     3            1            0
6  2018-10-01 00:00:00     7     3            2            1
7  2018-11-01 00:00:00     5     2            0            0

CodePudding user response：

Let's start with how to do it with a single column. First off we need to find each point where the value changes:

diffs = df['val1'].diff(1)
change_points = diffs != 0

Since previously encountered values can show up again we need to find a proper way to distinguish different streaks of the same value. We'll achieve that with a cumulative sum of the change point array. For ease we'll create a temporary DataFrame to hold those results (you can create a new column in your dataframe, but that's a bit messy)

change_points_cumsum = change_points.cumsum()
tmp_merged = pd.concat([serie, change_points_cumsum], axis=1, keys=['val1', 'change_points_cumsum'])

Every streak is assigned a different change_points_cumsum value so now a simple group by and cumulative count can be applied to get the final result

tmp_merged["val1-streak"] = tmp_merged.groupby(change_points_cumsum).cumcount()
print(tmp_merged)

   val1  change_points_cumsum  val1-streak 
0     4                     1             0
1     5                     2             0
2     5                     2             1
3     6                     3             0
4     7                     4             0
5     7                     4             1
6     7                     4             2
7     5                     5             0

If you are looking for a shorter, more compact solution

change_points_cumsum = df['val1'].diff(1).ne(0).cumsum()
change_points_cumsum.groupby(change_points_cumsum).cumcount()

0    0
1    0
2    1
3    0
4    0
5    1
6    2
7    0

CodePudding user response：

Use custom function for generate counter by consecutive values by compare differencies per columns specified in list by Series.diff, compare for not equal by Series.ne with cumulative sum, last pass to GroupBy.cumcount:

vals = ['val1','val2']

def f(x):
    x = x.diff().ne(0).cumsum()
    return x.groupby(x).cumcount()

df = df.join(df[vals].apply(f).add_suffix('_streak'))