For a given month, find whether there is at least one 8 day period where 7 of those 8 days are marke-CodePudding

OK, so this is a tricky question to describe.

I am putting together a compliance report for a study, whereby the team needs to know if each user performed a particular task on 7 out of 8 days at least once per month.

So I need a way to:

For a given user and a given month, search the month for an 8 day period where at least 7 out of those 8 days were marked "TRUE"
Return a 1 is requirement was met, 0 if requirement was not met.

Here is an example of the data structure:

import pandas as pd

ids = 1
req_met = ['TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE']
date = ['2018-01-01', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10', '2018-01-11',
'2018-01-12',  '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16', '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-21', '2018-01-23']

df = pd.DataFrame({'id':ids,
                   'Date':date,
                   'req_met':req_met})
print(df)

    id        Date req_met
0    1  2018-01-01    TRUE
1    1  2018-01-03    TRUE
2    1  2018-01-04   FALSE
3    1  2018-01-05   FALSE
4    1  2018-01-06    TRUE
5    1  2018-01-07    TRUE
6    1  2018-01-08    TRUE
7    1  2018-01-09   FALSE
8    1  2018-01-10    TRUE
9    1  2018-01-11    TRUE
10   1  2018-01-12    TRUE
11   1  2018-01-13    TRUE
12   1  2018-01-14   FALSE
13   1  2018-01-15   FALSE
14   1  2018-01-16   FALSE
15   1  2018-01-17   FALSE
16   1  2018-01-18    TRUE
17   1  2018-01-19    TRUE
18   1  2018-01-21    TRUE
19   1  2018-01-23   FALSE

For this user, the answer returned would be '1' because they do have an 8 day period whereby 7 of those days were 'TRUE' (2018-01-06 to 2018-01-13). You can see that the date range is not always consecutive which is an added complication.

My desired output would be the most efficient function that could take this data and return either a '1' (requirement met) or a '0' (requirement not met)

Thanks in advance for any assistance.

CodePudding user response：

First convert types to bool and datetime and calculate Month column:

df['req_met'] = df['req_met'].replace({'TRUE':True, 'FALSE':False})
df['Month'] = pd.to_datetime(df.Date).dt.strftime("%Y-%m")

Then use rolling method:

df_result = (df.groupby(['id','Month'])
             .rolling(8)['req_met'].sum().ge(7)
             .groupby(['id','Month'])
             .agg({('req_met','max')})
             .reset_index()
             )

the result is:

   id    Month  req_met
0   1  2018-01     True

Please note groupby is used twice. You can inspect calculations by running code step-by-step to fully understand the logic.

CodePudding user response：

EDIT: my bad I misread your response to my question and thought you were confirming that the 8-day windows needed to be consecutive. Since that's not the case, @ipj's answer works well.

Old answer, when I mistakenly thought every day in the 8-day windows needed be consecutive

First, I'll concatenate a couple copies with different id values:

df1 = pd.DataFrame({'id':ids, 'Date': date, 'req_met': req_met})

df2 = df1.copy()
df2.id = 2

df3 = df1.copy()
df3.id = 3

df = pd.concat([df1, df2, df3]).reset_index(drop=True)
df.Date = pd.to_datetime(df.Date)

Now, drop the rows which aren't consecutive dates:

>>> mask = ((df.Date - df.Date.shift(-1)).dt.days.abs() == 1) | (df.Date.diff().dt.days == 1)

Now you can do a rolling sum:

>>> (df[mask].groupby("id").req_met.rolling(8).sum() >= 7).groupby("id").sum()
id
1    1
2    1
3    1
Name: req_met, dtype: int64

All three are 1 just because I just copied the original dataframe three times but this should work for whatever actual dataframe you have. If your data aren't already grouped by date, then you'll need to add that to the groupby.