OK, so this is a tricky question to describe.
I am putting together a compliance report for a study, whereby the team needs to know if each user performed a particular task on 7 out of 8 days at least once per month.
So I need a way to:
- For a given user and a given month, search the month for an 8 day period where at least 7 out of those 8 days were marked "TRUE"
- Return a 1 is requirement was met, 0 if requirement was not met.
Here is an example of the data structure:
import pandas as pd
ids = 1
req_met = ['TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE']
date = ['2018-01-01', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10', '2018-01-11',
'2018-01-12', '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16', '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-21', '2018-01-23']
df = pd.DataFrame({'id':ids,
'Date':date,
'req_met':req_met})
print(df)
id Date req_met
0 1 2018-01-01 TRUE
1 1 2018-01-03 TRUE
2 1 2018-01-04 FALSE
3 1 2018-01-05 FALSE
4 1 2018-01-06 TRUE
5 1 2018-01-07 TRUE
6 1 2018-01-08 TRUE
7 1 2018-01-09 FALSE
8 1 2018-01-10 TRUE
9 1 2018-01-11 TRUE
10 1 2018-01-12 TRUE
11 1 2018-01-13 TRUE
12 1 2018-01-14 FALSE
13 1 2018-01-15 FALSE
14 1 2018-01-16 FALSE
15 1 2018-01-17 FALSE
16 1 2018-01-18 TRUE
17 1 2018-01-19 TRUE
18 1 2018-01-21 TRUE
19 1 2018-01-23 FALSE
For this user, the answer returned would be '1' because they do have an 8 day period whereby 7 of those days were 'TRUE' (2018-01-06 to 2018-01-13). You can see that the date range is not always consecutive which is an added complication.
My desired output would be the most efficient function that could take this data and return either a '1' (requirement met) or a '0' (requirement not met)
Thanks in advance for any assistance.
CodePudding user response:
First convert types to bool
and datetime
and calculate Month
column:
df['req_met'] = df['req_met'].replace({'TRUE':True, 'FALSE':False})
df['Month'] = pd.to_datetime(df.Date).dt.strftime("%Y-%m")
Then use rolling
method:
df_result = (df.groupby(['id','Month'])
.rolling(8)['req_met'].sum().ge(7)
.groupby(['id','Month'])
.agg({('req_met','max')})
.reset_index()
)
the result is:
id Month req_met
0 1 2018-01 True
Please note groupby
is used twice. You can inspect calculations by running code step-by-step to fully understand the logic.
CodePudding user response:
EDIT: my bad I misread your response to my question and thought you were confirming that the 8-day windows needed to be consecutive. Since that's not the case, @ipj's answer works well.
Old answer, when I mistakenly thought every day in the 8-day windows needed be consecutive
First, I'll concatenate a couple copies with different id
values:
df1 = pd.DataFrame({'id':ids, 'Date': date, 'req_met': req_met})
df2 = df1.copy()
df2.id = 2
df3 = df1.copy()
df3.id = 3
df = pd.concat([df1, df2, df3]).reset_index(drop=True)
df.Date = pd.to_datetime(df.Date)
Now, drop the rows which aren't consecutive dates:
>>> mask = ((df.Date - df.Date.shift(-1)).dt.days.abs() == 1) | (df.Date.diff().dt.days == 1)
Now you can do a rolling sum:
>>> (df[mask].groupby("id").req_met.rolling(8).sum() >= 7).groupby("id").sum()
id
1 1
2 1
3 1
Name: req_met, dtype: int64
All three are 1
just because I just copied the original dataframe three times but this should work for whatever actual dataframe you have. If your data aren't already grouped by date, then you'll need to add that to the groupby
.