I've been trying to sum values of two dataframes based on dates and also on others rows conditions, but have no idea on how to begin it with Python.
What I need to do is similar to a SUMIF in Excel/Google Sheets, it'll use columns 'loja' and 'item' and also consider if the dates on df1 and df2 are >= than the open date or <= on than the invent date on df3.
Look at the small versions of my dataframe.
I have df1 that looks like this:
loja date item sale
0 9288 2022-12-01 97845 10
1 9288 2022-12-03 97845 5
2 9288 2022-12-13 97845 6
3 9925 2022-12-02 33515 534
4 9288 2022-12-24 33515 839
5 9925 2022-12-12 33515 534
6 9925 2022-12-10 33515 234
And df2 that looks like this:
loja date item buy
0 9288 2022-12-01 97845 269
1 9288 2022-12-03 97845 415
2 9288 2022-12-30 97845 324
3 9925 2022-12-02 33515 116
4 9288 2022-12-24 33515 270
5 9925 2022-12-02 33515 248
6 9925 2022-12-15 33515 354
And df3 that look like this:
loja item open invent
0 9288 97845 2022-12-01 2022-12-20
1 9925 33515 2022-12-02 2022-12-12
2 9288 33515 2022-12-01 2022-12-20
So, in the end, I'm expecting to have a dataframe like this:
loja item open invent buy sale
0 9288 97845 2022-12-01 2022-12-20 684 21
1 9925 33515 2022-12-02 2022-12-12 364 1302
2 9288 33515 2022-12-01 2022-12-20 0 0
Edit: thoses are small versions of my dataframes!
CodePudding user response:
You need to use SUMIFS
formula with evaluating loja, item and 2 dates.
This is output:
Where formula in F20
(dragged down to F21
and F22
) is =SUMIFS($E$11:$E$17,$C$11:$C$17,">="&D20,$C$11:$C$17,"<="&E20,$B$11:$B$17,B20,$D$11:$D$17,C20)
And formula in G20
(dragged down to G21
and G22
) is =SUMIFS($E$2:$E$8,$C$2:$C$8,">="&D20,$C$2:$C$8,"<="&E20,$B$2:$B$8,B20,$D$2:$D$8,C20)
CodePudding user response:
Run a merge between df1
and df3
, df2
and df3
, and run a sum on the groupby:
buy_df = (df3
.merge(df2, on = ['loja', 'item'])
# loc might be faster
# ensure numexpr is installed
# for more performance if using query
.query("open<=date<=invent")
)
sale_df = (df3
.merge(df1, on = ['loja', 'item'])
.query("open<=date<=invent")
)
(pd
.concat([buy_df,sale_df])
.groupby([*df3.columns])
.sum(numeric_only=True)
.reindex(df3, fill_value=0)
)
buy sale
loja item open invent
9288 97845 2022-12-01 2022-12-20 684.0 21.0
9925 33515 2022-12-02 2022-12-12 364.0 1302.0
9288 33515 2022-12-01 2022-12-20 0.0 0.0
Another option is with conditional_join
from pyjanitor
- depending on the data size, this option might be faster:
# pip install pyjanitor
import pandas as pd
import janitor
sales = (df3
.conditional_join(
df1,
('open', 'date', '<='),
('invent', 'date', '>='),
('loja', 'loja', '=='),
('item', 'item', '=='),
how = 'left',
right_columns='sale')
)
buy = (df3
.conditional_join(
df2,
('open', 'date', '<='),
('invent', 'date', '>='),
('loja', 'loja', '=='),
('item', 'item', '=='),
how = 'left',
right_columns='buy')
)
pd.concat([buy, sales]).groupby([*df3.columns]).sum()
buy sale
loja item open invent
9288 33515 2022-12-01 2022-12-20 0.0 0.0
97845 2022-12-01 2022-12-20 684.0 21.0
9925 33515 2022-12-02 2022-12-12 364.0 1302.0