Home > Enterprise >  Sum values between two dates based on row conditions
Sum values between two dates based on row conditions

Time:12-14

I've been trying to sum values of two dataframes based on dates and also on others rows conditions, but have no idea on how to begin it with Python.

What I need to do is similar to a SUMIF in Excel/Google Sheets, it'll use columns 'loja' and 'item' and also consider if the dates on df1 and df2 are >= than the open date or <= on than the invent date on df3.

Look at the small versions of my dataframe.

I have df1 that looks like this:

    loja     date         item  sale
0   9288    2022-12-01  97845   10
1   9288    2022-12-03  97845   5
2   9288    2022-12-13  97845   6
3   9925    2022-12-02  33515   534
4   9288    2022-12-24  33515   839
5   9925    2022-12-12  33515   534
6   9925    2022-12-10  33515   234

And df2 that looks like this:

    loja    date       item    buy
0   9288    2022-12-01  97845   269
1   9288    2022-12-03  97845   415
2   9288    2022-12-30  97845   324
3   9925    2022-12-02  33515   116
4   9288    2022-12-24  33515   270
5   9925    2022-12-02  33515   248
6   9925    2022-12-15  33515   354

And df3 that look like this:

    loja    item    open    invent
0   9288    97845   2022-12-01  2022-12-20
1   9925    33515   2022-12-02  2022-12-12
2   9288    33515   2022-12-01  2022-12-20

So, in the end, I'm expecting to have a dataframe like this:

    loja    item    open    invent  buy sale
0   9288    97845   2022-12-01  2022-12-20  684 21
1   9925    33515   2022-12-02  2022-12-12  364 1302
2   9288    33515   2022-12-01  2022-12-20  0   0

Edit: thoses are small versions of my dataframes!

CodePudding user response:

You need to use SUMIFS formula with evaluating loja, item and 2 dates. This is output: enter image description here

Where formula in F20 (dragged down to F21 and F22) is =SUMIFS($E$11:$E$17,$C$11:$C$17,">="&D20,$C$11:$C$17,"<="&E20,$B$11:$B$17,B20,$D$11:$D$17,C20)

And formula in G20 (dragged down to G21 and G22) is =SUMIFS($E$2:$E$8,$C$2:$C$8,">="&D20,$C$2:$C$8,"<="&E20,$B$2:$B$8,B20,$D$2:$D$8,C20)

CodePudding user response:

Run a merge between df1 and df3, df2 and df3, and run a sum on the groupby:

buy_df = (df3
          .merge(df2, on = ['loja', 'item'])
            # loc might be faster
            # ensure numexpr is installed 
            # for more performance if using query
          .query("open<=date<=invent")
         )

sale_df = (df3
          .merge(df1, on = ['loja', 'item'])
          .query("open<=date<=invent")
         )

(pd
.concat([buy_df,sale_df])
.groupby([*df3.columns])
.sum(numeric_only=True)
.reindex(df3, fill_value=0)
)

                                    buy    sale
loja item  open       invent
9288 97845 2022-12-01 2022-12-20  684.0    21.0
9925 33515 2022-12-02 2022-12-12  364.0  1302.0
9288 33515 2022-12-01 2022-12-20    0.0     0.0

Another option is with conditional_join from pyjanitor - depending on the data size, this option might be faster:

# pip install pyjanitor
import pandas as pd
import janitor

sales = (df3
         .conditional_join(
             df1, 
             ('open', 'date', '<='), 
             ('invent', 'date', '>='), 
             ('loja', 'loja', '=='), 
             ('item', 'item', '=='), 
             how = 'left', 
             right_columns='sale')
         )

buy  =  (df3
         .conditional_join(
             df2, 
             ('open', 'date', '<='), 
             ('invent', 'date', '>='), 
             ('loja', 'loja', '=='), 
             ('item', 'item', '=='), 
             how = 'left', 
             right_columns='buy')
         )

pd.concat([buy, sales]).groupby([*df3.columns]).sum()

                                    buy    sale
loja item  open       invent
9288 33515 2022-12-01 2022-12-20    0.0     0.0
     97845 2022-12-01 2022-12-20  684.0    21.0
9925 33515 2022-12-02 2022-12-12  364.0  1302.0
  • Related