If values in multiple columns match another dataframe, get sum based on range of dates pandas-CodePudding

I have a two dfs:

df1:

    item_code   store_code  start_1      end_1
0   11185      01           2022-03-06   2022-03-08
1   11185      02           2022-03-26   2022-03-28
2   118113     01           2022-02-02   2022-02-04
3   118113     02           2022-04-29   2022-04-30
4   118113     02           2022-05-01   2022-05-02
5   000544     01           2022-04-15   2022-04-17

df2:

    item_code       store_code  date_code   sales_sum
0   11185            01         2022-03-06  1.5
1   11185            01         2022-03-07  1.0
2   11185            01         2022-03-08  1.0
4   11185            02         2022-03-25  1.5
5   11185            02         2022-03-26  1.5
6   11185            02         2022-03-27  2.0
7   11185            02         2022-03-28  2.0
8   118113           01         2022-02-02  2.0
9   118113           01         2022-02-03  2.0
10  118113           01         2022-02-04  2.0
11  118113           02         2022-02-05  3.0
12  118113           02         2022-04-29  3.0
13  118113           02         2022-04-30  3.0
14  118113           02         2022-05-01  5.0
15  118113           02         2022-05-02  5.0
16  000544           01         2022-04-16  7.0

What I'm trying to do, is to get .sum() of sales_sum if item_code and store_code is the same as in df1, within a range of dates (start = start_1, end = end_1).

Expected output should be:

    item_code   store_code  start_1      end_1        sales_sum
0   11185      01           2022-03-06   2022-03-08   3.5
1   11185      02           2022-03-26   2022-03-28   5.5
2   118113     01           2022-02-02   2022-02-04   6.0
3   118113     02           2022-04-29   2022-04-30   6.0
4   118113     02           2022-05-01   2022-05-02   10.0
5   000544     01           2022-04-15   2022-04-17   7.0

Nothing work from what I tryed to do, like .sum() with conditions & or creating multiple dfs in dictionary in order to apply for each chunk of item_code and store_code a range of dates.

Size of df1: 33667 rows, and size of df2: 4408248 rows

CodePudding user response：

You can use merge and query:

out = (df1.merge(df2, how='outer', suffixes=('', '_'))
          .query('(start_1 <= date_code) & (date_code <= end_1)')
          .groupby(df1.columns.tolist(), as_index=False)['sales_sum'].sum())
print(out)

# Output
   item_code store_code     start_1       end_1  sales_sum
0      11185         01  2022-03-06  2022-03-08        3.5
1      11185         02  2022-03-26  2022-03-28        5.5
2     118113         01  2022-02-02  2022-02-04        6.0
3     118113         02  2022-04-29  2022-04-30        6.0