Home > Enterprise >  If values in multiple columns match another dataframe, get sum based on range of dates pandas
If values in multiple columns match another dataframe, get sum based on range of dates pandas

Time:05-26

I have a two dfs:

df1:

    item_code   store_code  start_1      end_1
0   11185      01           2022-03-06   2022-03-08
1   11185      02           2022-03-26   2022-03-28
2   118113     01           2022-02-02   2022-02-04
3   118113     02           2022-04-29   2022-04-30
4   118113     02           2022-05-01   2022-05-02
5   000544     01           2022-04-15   2022-04-17

df2:

    item_code       store_code  date_code   sales_sum
0   11185            01         2022-03-06  1.5
1   11185            01         2022-03-07  1.0
2   11185            01         2022-03-08  1.0
4   11185            02         2022-03-25  1.5
5   11185            02         2022-03-26  1.5
6   11185            02         2022-03-27  2.0
7   11185            02         2022-03-28  2.0
8   118113           01         2022-02-02  2.0
9   118113           01         2022-02-03  2.0
10  118113           01         2022-02-04  2.0
11  118113           02         2022-02-05  3.0
12  118113           02         2022-04-29  3.0
13  118113           02         2022-04-30  3.0
14  118113           02         2022-05-01  5.0
15  118113           02         2022-05-02  5.0
16  000544           01         2022-04-16  7.0

What I'm trying to do, is to get .sum() of sales_sum if item_code and store_code is the same as in df1, within a range of dates (start = start_1, end = end_1).

Expected output should be:

    item_code   store_code  start_1      end_1        sales_sum
0   11185      01           2022-03-06   2022-03-08   3.5
1   11185      02           2022-03-26   2022-03-28   5.5
2   118113     01           2022-02-02   2022-02-04   6.0
3   118113     02           2022-04-29   2022-04-30   6.0
4   118113     02           2022-05-01   2022-05-02   10.0
5   000544     01           2022-04-15   2022-04-17   7.0

Nothing work from what I tryed to do, like .sum() with conditions & or creating multiple dfs in dictionary in order to apply for each chunk of item_code and store_code a range of dates.

Size of df1: 33667 rows, and size of df2: 4408248 rows

CodePudding user response:

You can use merge and query:

out = (df1.merge(df2, how='outer', suffixes=('', '_'))
          .query('(start_1 <= date_code) & (date_code <= end_1)')
          .groupby(df1.columns.tolist(), as_index=False)['sales_sum'].sum())
print(out)

# Output
   item_code store_code     start_1       end_1  sales_sum
0      11185         01  2022-03-06  2022-03-08        3.5
1      11185         02  2022-03-26  2022-03-28        5.5
2     118113         01  2022-02-02  2022-02-04        6.0
3     118113         02  2022-04-29  2022-04-30        6.0
  • Related