mimicking a complex SQL statement with DataFrames-CodePudding

I'm getting back into Python after a couple of years away, I am re-writing an old msAccess budget program using python, specifically using data frames.

There is a table called 'A' (which can be very large) which contains a history of expenditure. Table B is a table that contains a snapshot of recent expenditure that has to be added to table A. Note however that table B can contain duplicates already in A. What I need to do is find the entries in table B that are not in A then append them to A.

In the Access program the following msaccess SQL query was used:

SELECT A.Date, A.Item, A.Debit, A.Credit, A.Balance, A.Description
FROM A LEFT JOIN B
ON(A.Date=B.Date) AND(A.Item=B.Item) AND(A.Balance=B.Balance)
WHERE(((B.Date) Is Null)) Or(((B.Item) Is Null)) Or(((B.Balance) Is Null))

the result was then added to A, one at a time.

How can I do this using dataframes instead? I have tried different merge techniques with no success.

Here are 2 sample dataframes ( the data has been modified and is not authentic)

dfA = pd.DataFrame({
    'Date': ['2023-01-25', '2023-01-24', '2023-01-24', '2023-01-23'],
    'Item': ['Visa Purchase 21Jan Hanaro Northlakes    Mango Hil    ',
             'Visa Purchase 21Jan Event Cinemas North  North Lak',
             'Visa Purchase 21Jan Event Cinemas North  North Lak',
             'Mcare Benefits 4880000027 Eywq'],
    'Debit': [67.0, 10.2, 7.65, 39.75],
    'Credit': [0.0, 0, 0, 0],
    'Balance': [1830.0, 1897.99, 1908.019, 1915.84],
    'Description': ['a', 'b', 'c', 'd']
})

dfB = pd.DataFrame({
    'Date': ['2023-01-23', '2023-01-23', '2023-01-23', '2023-01-23'],
    'Item': ['Csc R555558Df Nett',
             'Tfr Wdl BPAY Internet 25Jan05:31 208655638973732700058Deft Payments',
             'Eftpos Debit 25Jan15:19 Sq *Becs Cafe Kippa-Ring   Qldau',
             'Mcare Benefits 4880000027 Eywq'],
    'Debit': [0, 168.0, 9.0, 39.75],
    'Credit': [907.92, 0, 0, 0],
    'Balance': [2053.09, 1885.07, 1876.09, 1915.84],
    'Description': ['z', 'x', 's', 'f']})


print(f"A\n {dfA.head()}")
print(f"\nB\n {dfB.head()}")

(The date data here is a string but in my program they are both date types)

Any help would be much appreciated

CodePudding user response：

IIUC, you just want to add new rows from dfB into dfA, that aren't duplicates. I think this is straightforward. In your example, this drops 1 row from dfB, so newdf is now 7 rows.

newdf = pd.concat([dfA, dfB]).drop_duplicates(subset=['Date', 'Item', 'Balance'], keep='first').reset_index(drop=True)