I have this df:
DATE CODE YEAR_MONTH PP CONDITION
10958 1991-01-01 100015 1991-01 1.2 True
10959 1991-01-02 100015 1991-01 0.0 False
10960 1991-01-03 100015 1991-01 0.7 False
10961 1991-01-04 100015 1991-01 1.0 True
10962 1991-01-05 100015 1991-01 1.0 True
10962 1991-01-06 100015 1991-01 2.0 True
10962 1991-01-07 100015 1991-01 3.0 True
10962 1991-01-08 100015 1991-01 2.0 True
10962 1991-01-09 100015 1991-01 0.5 False
10962 1991-01-10 100015 1991-01 0.1 False
10962 1991-01-11 100015 1991-01 1.2 True
10962 1991-01-12 100015 1991-01 1.0 True
... ... ... ... ...
21911 1991-01-01 100016 1991-01 0.0 False
21912 1991-01-02 100016 1991-01 0.0 False
21913 1991-01-03 100016 1991-01 1.0 True
21914 1991-01-04 100016 1991-01 2.0 True
21915 1991-01-05 100016 1991-01 1.4 True
... ... ... ... ...
I want to obtain the maximum group of consecutive True values per YEAR_MONTH and CODE.
For example if i have 3 groups of consecutive True values in a specific YEAR_MONTH and CODE with 2, 3 and 5 consecutive True values, i want only the max consecutive group of True values. In this case 5.
PostData: df['CONDITION'] == True when df['PP'] >= 1
Expected result:
CODE YEAR_MONTH MAX_CONSEC_VALUE
100015 1991-01 5
100015 1991-02 3
100015 1991-03 4
100016 1991-01 3
100016 1991-02 2
100016 1991-03 2
... ... ...
I tried this code:
#Generate the CONDITION column
df['CONDITION']=(df['PP']>=1)
#Cumulative consecutive values
group=df.groupby(['CODE','YEAR_MONTH'])['CONDITION'].cumsum()
#Creating the group dataframe
group=pd.DataFrame({'index':group.index, 'PP':group.values})
#Merging the CODE and YEAR_MONTH
df.reset_index(drop=False,inplace=True)
consecutives=pd.merge(group,df[['index','CODE','YEAR_MONTH']],on='index', how='left')
#Getting the max value per CODE and YEAR_MONTH
consecutives_final = consecutives.groupby(['CODE','YEAR_MONTH']).max()
But this code gets the total True consecutive values by CODE and YEAR_MONTH.
Would you mind to help me?
Thanks in advance.
CodePudding user response:
Try groupby on the key along with the cumsum of the negate condition (for the consecutive True
blocks):
# this assumes the data is sorted by `CODE, YEAR_MONTH` already
(df.groupby(['CODE','YEAR_MONTH', (-df['CONDITION']).cumsum()])['CONDITION'].sum()
.groupby(['CODE','YEAR_MONTH']).max()
)
Output (from the sample data):
CODE YEAR_MONTH
100015 1991-01 5
100016 1991-01 3
Name: CONDITION, dtype: int64
CodePudding user response:
Let's try
out = (df.groupby(['CODE','YEAR_MONTH'])
.apply(lambda g: (g['CONDITION'].ne(g['CONDITION'].shift()).cumsum() # Group consecutive value
[g['CONDITION']] # Keep True
.value_counts().max())) # Find max count of True values among groups
.to_frame('MAX_CONSEC_VALUE').reset_index())
print(out)
CODE YEAR_MONTH MAX_CONSEC_VALUE
0 100015 1991-01 5
1 100016 1991-01 3