Find max number of consecutive months a customer has been active in pandas-CodePudding

I am trying to find the max number of consecutive months a customer is active with the store. Here is the data I have.

 ----- --------- -------------- -------- ------------- 
| Id  | t_year  | t_month_prx  | Store  | diff_months |
 ----- --------- -------------- -------- ------------- 
|  1  |   2021  |        10.0  | A001   |         1.0 |
|  1  |   2022  |         1.0  | A001   |         1.0 |
|  1  |   2022  |         2.0  | A001   |         1.0 |
|  2  |   2021  |         1.0  | A001   |         1.0 |
|  2  |   2021  |         2.0  | A001   |         1.0 |
|  2  |   2021  |         3.0  | A001   |         1.0 |
|  2  |   2021  |         6.0  | A001   |         3.0 |
|  2  |   2021  |         7.0  | A001   |         1.0 |
|  2  |   2021  |         8.0  | A001   |         1.0 |
|  2  |   2021  |         9.0  | A001   |         1.0 |
|  2  |   2021  |        10.0  | A001   |         1.0 |
|  2  |   2022  |         1.0  | A001   |         1.0 |
|  2  |   2022  |         2.0  | A001   |         1.0 |
|  2  |   2021  |         1.0  | A002   |         1.0 |
|  2  |   2021  |         2.0  | A002   |         1.0 |
|  2  |   2021  |         3.0  | A002   |         1.0 |
|  2  |   2021  |         6.0  | A002   |         3.0 |
|  2  |   2021  |         7.0  | A002   |         1.0 |
|  2  |   2021  |         8.0  | A002   |         1.0 |
|  2  |   2021  |         9.0  | A002   |         1.0 |
|  2  |   2021  |        10.0  | A002   |         1.0 |
|  3  |   2021  |        10.0  | A002   |         1.0 |
|  3  |   2022  |         1.0  | A002   |         1.0 |
|  3  |   2022  |         2.0  | A002   |         1.0 |
 ----- --------- -------------- -------- -------------

Original problem involved skipping two months so I made a month proxy. So, instead of 12 months there are 10 months in a year.

what I have tried so far is:

df = df.sort_values(by=['Id','t_year','t_month_prx'], ascending = True).reset_index(drop=True)
df['diff_months'] = df.groupby(['Id', 't_year'])['t_month_prx'].diff()
df['diff_months'].fillna(method='bfill', inplace=True)

and I get this result

df_result = pd.DataFrame({
'Id': {0: 1,1: 1,2: 1,3: 2,4: 2,5: 2,6: 2,7: 2, 8: 2,9: 2, 10: 2, 11: 2, 12: 2},
't_year': {0: 2021, 1: 2022, 2: 2022, 3: 2021,4: 2021,5: 2021,6: 2021,7: 2021,8: 2021,9: 2021,10: 2021,11: 2022,12: 2022},
 't_month_prx': {0: 10.0,1: 1.0,2: 2.0,3: 1.0,4: 2.0,5: 3.0,6: 6.0,7: 7.0,
 8: 8.0,9: 9.0,10: 10.0,11: 1.0,12: 2.0},
'diff_months': {0: 1.0, 1: 1.0, 2: 1.0,3: 1.0,4: 1.0,5: 1.0,6: 3.0,7: 1.0,8: 1.0,9: 1.0,10: 1.0,11: 1.0, 12: 1.0}
})

then finally I tired to count all consecutive 1s

df.groupby([df['Id'], df['diff_months'].ne(df.groupby('Id')['diff_months'].shift(1)).cumsum()])['diff_months'].sum().groupby(level=0).max().reset_index(name='consecutive_month')

It gives me following results

pd.DataFrame({
'Id': {0: 1,1: 2},
'counts': {0: 3.0,1: 6.0}
})

but desired output is:

pd.DataFrame({'Id': [1,2, 2, 3], 'Store': ['A001','A001', 'A002', 'A002'], 'counts': [3, 7, 5, 3]})

which looks like:

   Id Store  counts
0   1  A001       3
1   2  A001       7
2   2  A002       5
3   3  A002       3

So, for 2nd customer it should be 7 months, since I am counting only 1s it skips 3. similarly there can be multiple smaller sequences of 1's and in that case will have to select the max count of 1's. Is my approach good? Any idea how I can count consecutive months that can span over different years?

CodePudding user response：

You could first create groups using groupby diff (in any given year, to be "consecutive", the difference has to be 1; across years, it has to be -9). Then use the groups in another groupby size to find the consecutive counts; then do yet another groupby max to find the maximum consecutive counts per Id.

cols = ['Id', 'Store']
g = df.groupby(cols)
month_diff = g['t_month_prx'].diff()
year_diff = g['t_year'].diff()
nonconsecutive = ~((year_diff.eq(0) & month_diff.eq(1)) | (year_diff.eq(1) & month_diff.eq(-9)))
out = df.groupby([*cols, nonconsecutive.cumsum()]).size().droplevel(-1).groupby(cols).max().reset_index(name='counts')

Output:

   Id Store  counts
0   1  A001       3
1   2  A001       7
2   2  A002       5
3   3  A002       3

CodePudding user response：

this code works with the given example dataset:

df = df.sort_values(['Id','t_year','t_month_prx'])

gr = (df.groupby(['Store','Id']).
      apply(lambda x: (~x['t_month_prx'].diff().isin([1,-9])).cumsum()).
      reset_index(name='num'))
res = (gr.groupby(['Store','Id']).
       apply(lambda x: x.num.value_counts().max()).
       reset_index(name='counts'))

print(res)
'''
  Store  Id  counts
0  A001   1       3
1  A001   2       7
2  A002   2       5
3  A002   3       3