I want to find:
- data not found in provider, but found in accounting,
- data found in accounting, but not in provider
- data found in both, accounting and provider, but the amounts do not match.
To accomplish this I have to use three provided tables Accounting, Helper, Provider First I join Accounting with Helper on invoice. Then I join it with Provider on toID. After merge inner join I have a table which in this case creates two unnecessary records.
Accounting table looks like:
doc_nr | amount | invoice |
---|---|---|
a1 | -9.17 | i1 |
a2 | 9.17 | i1 |
Helper table:
toID | invoice |
---|---|
toid1 | i1 |
Provider table:
toID | tID | DATE | tAmount |
---|---|---|---|
toid1 | t1 | 2021-01-18 | 9.17 |
toid1 | t2 | 2021-01-19 | -9.17 |
After joining three tables using merge inner join I got a table:
doc_nr | amount | invoice | toID | tID | tAmount | DATE |
---|---|---|---|---|---|---|
a1 | -9.17 | i1 | toid1 | t1 | 9.17 | 2021-01-18 |
a1 | -9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-19 |
a2 | 9.17 | i1 | toid1 | t1 | 9.17 | 2021-01-18 |
a2 | 9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-19 |
Expected result should be only 2 records instead of 4:
doc_nr | amount | invoice | toID | tID | tAmount | DATE |
---|---|---|---|---|---|---|
a1 | -9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-19 |
a2 | 9.17 | i1 | toid1 | t1 | 9.17 | 2021-01-18 |
Can't think of a way how to filter the table to get expected results.
Tried dropping duplicates based on doc_nr, but results gives:
doc_nr | amount | invoice | toID | tID | tAmount | DATE |
---|---|---|---|---|---|---|
a1 | -9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-19 |
a2 | 9.17 | i1 | toid1 | t2 | -9.17 | 2021-01-18 |
It should have different tID; doc_nr and amount match tAmount. After join it seems it duplicates doc_nr creating it with other`s doc_nr and tID.
Any ideas on how to clean these duplicating rows and reach the expected result?
CodePudding user response:
Try this:
>>> df.assign(x=df['amount'].eq(df['tAmount']).cumsum()).replace(0, np.nan).dropna().drop_duplicates(subset=['doc_nr', 'x']).drop('x', axis=1)
doc_nr amount invoice toID tID tAmount DATE
1 a1 -9.17 i1 toid1 t2 -9.17 2021-01-19
2 a2 9.17 i1 toid1 t1 9.17 2021-01-18
>>>
CodePudding user response:
As you mentioned, you want amount
to match tAmount
:
It should have different tID; doc_nr and amount match tAmount.
then, you should include also these 2 fields in the second merge statement, as follows:
(Accounting.merge(Helper)
.merge(Provider, left_on=['toID', 'amount'], right_on=['toID', 'tAmount'])
)
use the left_on=
and right_on=
parameters to include the related 2 merge fields.
Result:
doc_nr amount invoice toID tID DATE tAmount
0 a1 -9.17 i1 toid1 t2 2021-01-19 -9.17
1 a2 9.17 i1 toid1 t1 2021-01-18 9.17