How to get the number of max consecutive values per month?-CodePudding

I have this df:

            DATE  CODE  YEAR_MONTH  PP  CONDITION
10958 1991-01-01  100015  1991-01  1.2      True
10959 1991-01-02  100015  1991-01  0.0      False
10960 1991-01-03  100015  1991-01  0.7      False
10961 1991-01-04  100015  1991-01  1.0      True
10962 1991-01-05  100015  1991-01  1.0      True
10962 1991-01-06  100015  1991-01  2.0      True
10962 1991-01-07  100015  1991-01  3.0      True
10962 1991-01-08  100015  1991-01  2.0      True
10962 1991-01-09  100015  1991-01  0.5      False
10962 1991-01-10  100015  1991-01  0.1      False
10962 1991-01-11  100015  1991-01  1.2      True
10962 1991-01-12  100015  1991-01  1.0      True
         ...     ...      ...  ...        ...
21911 1991-01-01  100016  1991-01  0.0      False
21912 1991-01-02  100016  1991-01  0.0      False
21913 1991-01-03  100016  1991-01  1.0      True
21914 1991-01-04  100016  1991-01  2.0      True
21915 1991-01-05  100016  1991-01  1.4      True
         ...     ...      ...  ...        ...

I want to obtain the maximum group of consecutive True values per YEAR_MONTH and CODE.

For example if i have 3 groups of consecutive True values in a specific YEAR_MONTH and CODE with 2, 3 and 5 consecutive True values, i want only the max consecutive group of True values. In this case 5.

PostData: df['CONDITION'] == True when df['PP'] >= 1

Expected result:

CODE  YEAR_MONTH MAX_CONSEC_VALUE  
100015  1991-01       5
100015  1991-02       3
100015  1991-03       4
100016  1991-01       3
100016  1991-02       2
100016  1991-03       2
...     ...          ...

I tried this code:

    #Generate the CONDITION column 
    df['CONDITION']=(df['PP']>=1)
    #Cumulative consecutive values
    group=df.groupby(['CODE','YEAR_MONTH'])['CONDITION'].cumsum()
    
    #Creating the group dataframe
    group=pd.DataFrame({'index':group.index, 'PP':group.values})
    
    
    #Merging the CODE and YEAR_MONTH
    df.reset_index(drop=False,inplace=True)
consecutives=pd.merge(group,df[['index','CODE','YEAR_MONTH']],on='index', how='left')    
    
#Getting the max value per CODE and YEAR_MONTH    
consecutives_final = consecutives.groupby(['CODE','YEAR_MONTH']).max()

But this code gets the total True consecutive values by CODE and YEAR_MONTH.

Would you mind to help me?

Thanks in advance.

CodePudding user response：

Try groupby on the key along with the cumsum of the negate condition (for the consecutive True blocks):

# this assumes the data is sorted by `CODE, YEAR_MONTH` already
(df.groupby(['CODE','YEAR_MONTH', (-df['CONDITION']).cumsum()])['CONDITION'].sum()
   .groupby(['CODE','YEAR_MONTH']).max()
)

Output (from the sample data):

CODE    YEAR_MONTH
100015  1991-01       5
100016  1991-01       3
Name: CONDITION, dtype: int64

CodePudding user response：

Let's try

out = (df.groupby(['CODE','YEAR_MONTH'])
       .apply(lambda g: (g['CONDITION'].ne(g['CONDITION'].shift()).cumsum() # Group consecutive value
                         [g['CONDITION']]                                   # Keep True
                         .value_counts().max()))                            # Find max count of True values among groups
       .to_frame('MAX_CONSEC_VALUE').reset_index())

print(out)

     CODE YEAR_MONTH  MAX_CONSEC_VALUE
0  100015    1991-01                 5
1  100016    1991-01                 3