I have a two dfs:
df1
:
item_code store_code start_1 end_1
0 11185 01 2022-03-06 2022-03-08
1 11185 02 2022-03-26 2022-03-28
2 118113 01 2022-02-02 2022-02-04
3 118113 02 2022-04-29 2022-04-30
4 118113 02 2022-05-01 2022-05-02
5 000544 01 2022-04-15 2022-04-17
df2
:
item_code store_code date_code sales_sum
0 11185 01 2022-03-06 1.5
1 11185 01 2022-03-07 1.0
2 11185 01 2022-03-08 1.0
4 11185 02 2022-03-25 1.5
5 11185 02 2022-03-26 1.5
6 11185 02 2022-03-27 2.0
7 11185 02 2022-03-28 2.0
8 118113 01 2022-02-02 2.0
9 118113 01 2022-02-03 2.0
10 118113 01 2022-02-04 2.0
11 118113 02 2022-02-05 3.0
12 118113 02 2022-04-29 3.0
13 118113 02 2022-04-30 3.0
14 118113 02 2022-05-01 5.0
15 118113 02 2022-05-02 5.0
16 000544 01 2022-04-16 7.0
What I'm trying to do, is to get .sum()
of sales_sum
if item_code
and store_code
is the same as in df1
, within a range of dates (start = start_1, end = end_1).
Expected output should be:
item_code store_code start_1 end_1 sales_sum
0 11185 01 2022-03-06 2022-03-08 3.5
1 11185 02 2022-03-26 2022-03-28 5.5
2 118113 01 2022-02-02 2022-02-04 6.0
3 118113 02 2022-04-29 2022-04-30 6.0
4 118113 02 2022-05-01 2022-05-02 10.0
5 000544 01 2022-04-15 2022-04-17 7.0
Nothing work from what I tryed to do, like .sum()
with conditions &
or creating multiple dfs in dictionary in order to apply for each chunk of item_code
and store_code
a range of dates.
Size of df1
: 33667 rows, and
size of df2
: 4408248 rows
CodePudding user response:
You can use merge
and query
:
out = (df1.merge(df2, how='outer', suffixes=('', '_'))
.query('(start_1 <= date_code) & (date_code <= end_1)')
.groupby(df1.columns.tolist(), as_index=False)['sales_sum'].sum())
print(out)
# Output
item_code store_code start_1 end_1 sales_sum
0 11185 01 2022-03-06 2022-03-08 3.5
1 11185 02 2022-03-26 2022-03-28 5.5
2 118113 01 2022-02-02 2022-02-04 6.0
3 118113 02 2022-04-29 2022-04-30 6.0