Home > Enterprise >  mimicking a complex SQL statement with DataFrames
mimicking a complex SQL statement with DataFrames

Time:01-30

I'm getting back into Python after a couple of years away, I am re-writing an old msAccess budget program using python, specifically using data frames.

There is a table called 'A' (which can be very large) which contains a history of expenditure. Table B is a table that contains a snapshot of recent expenditure that has to be added to table A. Note however that table B can contain duplicates already in A. What I need to do is find the entries in table B that are not in A then append them to A.

In the Access program the following msaccess SQL query was used:

SELECT A.Date, A.Item, A.Debit, A.Credit, A.Balance, A.Description
FROM A LEFT JOIN B
ON(A.Date=B.Date) AND(A.Item=B.Item) AND(A.Balance=B.Balance)
WHERE(((B.Date) Is Null)) Or(((B.Item) Is Null)) Or(((B.Balance) Is Null))

the result was then added to A, one at a time.

How can I do this using dataframes instead? I have tried different merge techniques with no success.

Here are 2 sample dataframes ( the data has been modified and is not authentic)

dfA = pd.DataFrame({
    'Date': ['2023-01-25', '2023-01-24', '2023-01-24', '2023-01-23'],
    'Item': ['Visa Purchase 21Jan Hanaro Northlakes    Mango Hil    ',
             'Visa Purchase 21Jan Event Cinemas North  North Lak',
             'Visa Purchase 21Jan Event Cinemas North  North Lak',
             'Mcare Benefits 4880000027 Eywq'],
    'Debit': [67.0, 10.2, 7.65, 39.75],
    'Credit': [0.0, 0, 0, 0],
    'Balance': [1830.0, 1897.99, 1908.019, 1915.84],
    'Description': ['a', 'b', 'c', 'd']
})

dfB = pd.DataFrame({
    'Date': ['2023-01-23', '2023-01-23', '2023-01-23', '2023-01-23'],
    'Item': ['Csc R555558Df Nett',
             'Tfr Wdl BPAY Internet 25Jan05:31 208655638973732700058Deft Payments',
             'Eftpos Debit 25Jan15:19 Sq *Becs Cafe Kippa-Ring   Qldau',
             'Mcare Benefits 4880000027 Eywq'],
    'Debit': [0, 168.0, 9.0, 39.75],
    'Credit': [907.92, 0, 0, 0],
    'Balance': [2053.09, 1885.07, 1876.09, 1915.84],
    'Description': ['z', 'x', 's', 'f']})


print(f"A\n {dfA.head()}")
print(f"\nB\n {dfB.head()}")

(The date data here is a string but in my program they are both date types)

Any help would be much appreciated

CodePudding user response:

IIUC, you just want to add new rows from dfB into dfA, that aren't duplicates. I think this is straightforward. In your example, this drops 1 row from dfB, so newdf is now 7 rows.

newdf = pd.concat([dfA, dfB]).drop_duplicates(subset=['Date', 'Item', 'Balance'], keep='first').reset_index(drop=True)
  • Related