How can I figure out the average consecutive duration of "True" values in pandas df, per g-CodePudding

With the following data, I think I want a column (DESIRED_DURATION_COL) to work out the duration (according to start_datetime) of consecutive Truths:

project_id	start_datetime	diag_local_code	DESIRED_DURATION_COL
1	2017-01-18	False	0
1	2019-04-14	True	0
1	2019-04-17	True	3
1	2019-04-19	False	0
1	2019-04-23	True	0
1	2019-04-25	True	2
1	2019-04-30	True	7
1	2019-05-21	False	0

This is so I can get the average truth duration (mean), per project_id, such that I get a df like:

project_id	avg_duration
1	5
2	8
3	2

Can't work out how to do this, thanks in advance!

CodePudding user response：

Solution for calculating duration:

m = df['diag_local_code']
dt = df[m].groupby(['project_id', (~m).cumsum()])['start_datetime'].transform('first')
df['duration'] = df['start_datetime'].sub(dt).dt.days.fillna(0)

How this works?

Use cumsum on inverted diag_local_code to identify groups of consecutive ones per project_id, then filter the rows where diag_local_code is True then group the filtered dataframe and transform start_datetime with first to broadcast first date value across each group, finally subtract the broadcasted date value from start_datetime to calculate the desired duration

Result

   project_id start_datetime  diag_local_code  duration
0           1     2017-01-18            False       0.0
1           1     2019-04-14             True       0.0
2           1     2019-04-17             True       3.0
3           1     2019-04-19            False       0.0
4           1     2019-04-23             True       0.0
5           1     2019-04-25             True       2.0
6           1     2019-04-30             True       7.0
7           1     2019-05-21            False       0.0

Solution for calculating average consecutive duration of True values

m = df['diag_local_code']
(
    df[m].groupby(['project_id', (~m).cumsum()])['start_datetime']
         .agg(np.ptp).dt.days.groupby(level=0).mean().reset_index(name='avg_duration')
)

Result:

   project_id  avg_duration
0           1           5.0

CodePudding user response：

You can group by project_id column and split each group into consecutive value groups. Then check the groups value is all True.

def avg_duration(group):
    subgroup = group.groupby(group['diag_local_code'].diff().ne(0).cumsum())
    true_count = subgroup.apply(lambda g: g['diag_local_code'].all()).sum()
    true_last_sum = subgroup.apply(lambda g: g.iloc[-1]['DESIRED_DURATION_COL'] if g['diag_local_code'].all() else 0).sum()
    return true_last_sum/true_count

out = df.groupby('project_id').apply(avg_duration).to_frame('avg_duration').reset_index()

print(out)

   project_id  avg_duration
0           1           5.0