Home > Software engineering >  Working with a multiindex dataframe, to get summation results over a boolean column, based on a cond
Working with a multiindex dataframe, to get summation results over a boolean column, based on a cond

Time:05-15

We have a multiindex dataframe that looks like:

                                  date   condition_1    condition_2
item1   0    2021-06-10 06:30:00 00:00          True          False
        1    2021-06-10 07:00:00 00:00         False           True
        2    2021-06-10 07:30:00 00:00          True           True
item2   3    2021-06-10 06:30:00 00:00          True          False
        4    2021-06-10 07:00:00 00:00          True           True
        5    2021-06-10 07:30:00 00:00          True           True
item3   6    2021-06-10 06:30:00 00:00          True           True
        7    2021-06-10 07:00:00 00:00         False           True
        8    2021-06-10 07:30:00 00:00          True           True

The value of date repeats between items (because the df is a result of a default concat on a dictionary of dataframes).

The logic we basically want to vectorize is "for every date where condition_1 is true for all items: sum the occurrences where condition_2 is true in a new results column for all of them".

The result would basically look like this based on the above example (comments on how it's derived: next to the results column):

                                  date   condition_1    condition_2    result
item1   0    2021-06-10 06:30:00 00:00          True          False         1 [because condition_1 is True for all items and condition_2 is True once]
        1    2021-06-10 07:00:00 00:00         False           True         0 [condition_1 is not True for all items so condition_2 is irrelevant]
        2    2021-06-10 07:30:00 00:00          True           True         3 [both conditions are True for all 3 items]
item2   3    2021-06-10 06:30:00 00:00          True          False         1 [a repeat for the same reasons]
        4    2021-06-10 07:00:00 00:00          True           True         0 [a repeat for the same reasons]
        5    2021-06-10 07:30:00 00:00          True           True         3 [a repeat for the same reasons]
item3   6    2021-06-10 06:30:00 00:00          True           True         1 [a repeat for the same reasons]
        7    2021-06-10 07:00:00 00:00         False           True         0 [a repeat for the same reasons]
        8    2021-06-10 07:30:00 00:00          True           True         3 [a repeat for the same reasons]

CodePudding user response:

Here is what I came up with.

def cond_sum(s):
    return s.cond1.all() * s.cond2.sum()

df.reset_index(level=0, inplace=True)
df['result'] = df.groupby('date').apply(cond_sum)
df.set_index('item', append=True)

Then if you want the original index, you can add it back.

df.set_index('item', append=True).swaplevel()

Note, you mentioned vectorized, so you could swap that out for:

dfg = df.groupby(level=0).agg({'cond1': 'all', 'cond2': 'sum'})
df['result'] = dfg.cond1 * dfg.cond2
  • Related