We have a multiindex dataframe that looks like:
date condition_1 condition_2
item1 0 2021-06-10 06:30:00 00:00 True False
1 2021-06-10 07:00:00 00:00 False True
2 2021-06-10 07:30:00 00:00 True True
item2 3 2021-06-10 06:30:00 00:00 True False
4 2021-06-10 07:00:00 00:00 True True
5 2021-06-10 07:30:00 00:00 True True
item3 6 2021-06-10 06:30:00 00:00 True True
7 2021-06-10 07:00:00 00:00 False True
8 2021-06-10 07:30:00 00:00 True True
The value of date
repeats between items (because the df is a result of a default concat on a dictionary of dataframes).
The logic we basically want to vectorize is "for every date where condition_1 is true for all items: sum the occurrences where condition_2 is true in a new results column for all of them".
The result would basically look like this based on the above example (comments on how it's derived: next to the results column):
date condition_1 condition_2 result
item1 0 2021-06-10 06:30:00 00:00 True False 1 [because condition_1 is True for all items and condition_2 is True once]
1 2021-06-10 07:00:00 00:00 False True 0 [condition_1 is not True for all items so condition_2 is irrelevant]
2 2021-06-10 07:30:00 00:00 True True 3 [both conditions are True for all 3 items]
item2 3 2021-06-10 06:30:00 00:00 True False 1 [a repeat for the same reasons]
4 2021-06-10 07:00:00 00:00 True True 0 [a repeat for the same reasons]
5 2021-06-10 07:30:00 00:00 True True 3 [a repeat for the same reasons]
item3 6 2021-06-10 06:30:00 00:00 True True 1 [a repeat for the same reasons]
7 2021-06-10 07:00:00 00:00 False True 0 [a repeat for the same reasons]
8 2021-06-10 07:30:00 00:00 True True 3 [a repeat for the same reasons]
CodePudding user response:
Here is what I came up with.
def cond_sum(s):
return s.cond1.all() * s.cond2.sum()
df.reset_index(level=0, inplace=True)
df['result'] = df.groupby('date').apply(cond_sum)
df.set_index('item', append=True)
Then if you want the original index, you can add it back.
df.set_index('item', append=True).swaplevel()
Note, you mentioned vectorized, so you could swap that out for:
dfg = df.groupby(level=0).agg({'cond1': 'all', 'cond2': 'sum'})
df['result'] = dfg.cond1 * dfg.cond2