Find and flag offsetting pandas dataframe rows-CodePudding

I have a huge data file that lists expense entries from our financial system. The fields I have are matter (a unique identifier for a specific job), date, and amount. Often, mistakes are made and the expenses are 'cancelled' by entering an offsetting cost to the same matter on the same date. Complicating the issue is that, occasionally, one of those cancelled entries is re-entered. This results in 3 rows (the original, the cancellation, and the re-entry) for one transaction. I'm trying to flag which rows are cancelling other rows and which rows have been cancelled.

In the code below, I wrote a simple function that goes through each row of my dataframe and compares the amount to the other amounts for a given matter and date. It flags the matter if there is another amount that offsets the amount being checked.

import pandas as pd
data = [
        [1,'1/2/2022',10],
        [1,'1/2/2022',15],
        [1,'1/2/2022',-10],
        [2,'1/4/2022',12],
        [2,'1/5/2022', 5],
        [2,'1/5/2022',-5],
        [2,'1/5/2022', 5]
    ]
df = pd.DataFrame(data, columns=['matter','date','amount'])


def rev_check(matter, date, WorkAmt, df):
    funcDF = df.loc[(df['matter'] == matter) & (df['date'] == date)] 
    listCheck = funcDF['amount'].tolist()
    if WorkAmt*-1 in listCheck:
        return 'yes'
    
df['reversal'] = df.apply(lambda row: rev_check(row.matter, row.date, row.amount, df), axis=1)

print(df)

resulting in the following output:

   matter      date  amount reversal
0       1  1/2/2022      10      yes
1       1  1/2/2022      15     None
2       1  1/2/2022     -10      yes
3       2  1/4/2022      12     None
4       2  1/5/2022       5      yes
5       2  1/5/2022      -5      yes
6       2  1/5/2022       5      yes

Where i'm stuck is on the 6th row. That row should not be flagged since the entry in row 4 was already reversed by the entry in row 5.

I would love any thoughts on how to better approach this so that I'm not flagging those re-entered amounts.

CodePudding user response：

I confess I have a bit of trouble following the @mozway solution. Here is an alternative I find a bit easier to wrap my head around.

# Steal clever technique of uniquely labelling
# repetitions of ['matter','date','amount']
df['occurrence'] = df.groupby(['matter','date','amount']).cumcount()

# Invert cancelling amounts for comparison in next step
df['abs_amount'] = df['amount'].abs()

# Mark items where cancels equals amount
df['reversal']   = df.duplicated(['matter', 'date', 'occurrence', 'abs_amount'],
                                 keep=False)

df is now...


  matter    date       amount   occurrence  abs_amount  reversal
0      1    1/2/2022       10            0          10      True
1      1    1/2/2022       15            0          15     False
2      1    1/2/2022      -10            0          10      True
3      2    1/4/2022       12            0          12     False
4      2    1/5/2022        5            0           5      True
5      2    1/5/2022       -5            0           5      True
6      2    1/5/2022        5            1           5     False

CodePudding user response：

Here is an approach by reshaping the dataframe:

df2 = (df
  .assign(sign=np.sign(df['amount']), # sign
          abs=df['amount'].abs()      # absolute value
          # occurrence of the value
          n=df.groupby(['matter', 'date', 'amount']).cumcount(),
         )
  .reset_index()
  .pivot(index=['matter', 'date', 'n', 'abs'], columns='sign', values='index')
)

reversal = df2.where(df2.notna().all(axis=1)).stack()
df['reversal'] = df.index.isin(reversal)

Output:

   matter      date  amount  reversal
0       1  1/2/2022      10      True
1       1  1/2/2022      15     False
2       1  1/2/2022     -10      True
3       2  1/4/2022      12     False
4       2  1/5/2022       5      True
5       2  1/5/2022      -5      True
6       2  1/5/2022       5     False

Intermediate df2:

sign                    -1    1
matter date     n abs          
1      1/2/2022 0 10   2.0  0.0  # both values, this is a reversal
                  15   NaN  1.0
2      1/4/2022 0 12   NaN  3.0
       1/5/2022 0 5    5.0  4.0  # both values, this is a reversal
                1 5    NaN  6.0  # this one is alone, not a reversal

variant

Same logic but using a double groupby:

df['reversal'] = (df
  .assign(sign=np.sign(df['amount']),
          n=df.groupby(['matter', 'date', 'amount']).cumcount(),
          abs=df['amount'].abs()
         )
  # if we have 2 values per group, this is a reversal
  .groupby(['matter', 'date', 'abs', 'n']).transform('size').eq(2)
 )