Combine dataframes based on multiple conditions in python-CodePudding

Table A

Item    amount  year
A   100 2011
A   110 2011
A   125 2012
A   25  2012
B   54  2013
C   55  2020
D   68  2022

Table B

item    year    desc
A   2011    xxx
A   2011    xxx
A   2012    xxx
B   2011    xxx
B   2012    xxx
B   2013    xxx
B   2014    xxx
C   2020    xxx
D   2022    xxx

Table C

item    year    desc    Total
A   2011    xxx 220
A   2011    xxx 0
A   2012    xxx 150
B   2011    xxx 0
B   2012    xxx 0
B   2013    xxx 54
B   2014    xxx 0
C   2020    xxx 55
D   2022    xxx 68

These are example tables that represent dataframes that I'm going to create from seperate excel sheets.

Basically there's a many to many relationship going on and I want to be able to create a combined sheet that will roll up the "amount" total (from Table A) for each year of the item (Table B). I don't want to duplicate the amounts for multiple items with the same year as you see in table A, item A has two 2011 entries.

What I'd like to get is a result like Table C below

I'm currently trying to use pandas merge but not having much luck. Can anyone lend a hand?

CodePudding user response：

IIUC, you can use a merge and post-process to remove the duplicates per year:

out = (dfB
    .merge(dfA.rename(columns={'Item': 'item'})
              .groupby(['item', 'year'], as_index=False).sum(), how='left')
    .assign(amount=lambda d: d['amount']
                             .mask(d.groupby('year').cumcount().gt(0), 0)
                             .fillna(0)
           )
)

output:

  item  year desc  amount
0    A  2011  xxx   210.0
1    A  2011  xxx     0.0
2    A  2012  xxx   150.0
3    B  2011  xxx     0.0
4    B  2012  xxx     0.0
5    B  2013  xxx    54.0
6    B  2014  xxx     0.0
7    C  2020  xxx    55.0
8    D  2022  xxx    68.0