I have a huge data file that lists expense entries from our financial system. The fields I have are matter
(a unique identifier for a specific job), date
, and amount
. Often, mistakes are made and the expenses are 'cancelled' by entering an offsetting cost to the same matter on the same date. Complicating the issue is that, occasionally, one of those cancelled entries is re-entered. This results in 3 rows (the original, the cancellation, and the re-entry) for one transaction. I'm trying to flag which rows are cancelling other rows and which rows have been cancelled.
In the code below, I wrote a simple function that goes through each row of my dataframe and compares the amount to the other amounts for a given matter and date. It flags the matter if there is another amount that offsets the amount being checked.
import pandas as pd
data = [
[1,'1/2/2022',10],
[1,'1/2/2022',15],
[1,'1/2/2022',-10],
[2,'1/4/2022',12],
[2,'1/5/2022', 5],
[2,'1/5/2022',-5],
[2,'1/5/2022', 5]
]
df = pd.DataFrame(data, columns=['matter','date','amount'])
def rev_check(matter, date, WorkAmt, df):
funcDF = df.loc[(df['matter'] == matter) & (df['date'] == date)]
listCheck = funcDF['amount'].tolist()
if WorkAmt*-1 in listCheck:
return 'yes'
df['reversal'] = df.apply(lambda row: rev_check(row.matter, row.date, row.amount, df), axis=1)
print(df)
resulting in the following output:
matter date amount reversal
0 1 1/2/2022 10 yes
1 1 1/2/2022 15 None
2 1 1/2/2022 -10 yes
3 2 1/4/2022 12 None
4 2 1/5/2022 5 yes
5 2 1/5/2022 -5 yes
6 2 1/5/2022 5 yes
Where i'm stuck is on the 6th row. That row should not be flagged since the entry in row 4 was already reversed by the entry in row 5.
I would love any thoughts on how to better approach this so that I'm not flagging those re-entered amounts.
CodePudding user response:
I confess I have a bit of trouble following the @mozway solution. Here is an alternative I find a bit easier to wrap my head around.
# Steal clever technique of uniquely labelling
# repetitions of ['matter','date','amount']
df['occurrence'] = df.groupby(['matter','date','amount']).cumcount()
# Invert cancelling amounts for comparison in next step
df['abs_amount'] = df['amount'].abs()
# Mark items where cancels equals amount
df['reversal'] = df.duplicated(['matter', 'date', 'occurrence', 'abs_amount'],
keep=False)
df is now...
matter date amount occurrence abs_amount reversal
0 1 1/2/2022 10 0 10 True
1 1 1/2/2022 15 0 15 False
2 1 1/2/2022 -10 0 10 True
3 2 1/4/2022 12 0 12 False
4 2 1/5/2022 5 0 5 True
5 2 1/5/2022 -5 0 5 True
6 2 1/5/2022 5 1 5 False
CodePudding user response:
Here is an approach by reshaping the dataframe:
df2 = (df
.assign(sign=np.sign(df['amount']), # sign
abs=df['amount'].abs() # absolute value
# occurrence of the value
n=df.groupby(['matter', 'date', 'amount']).cumcount(),
)
.reset_index()
.pivot(index=['matter', 'date', 'n', 'abs'], columns='sign', values='index')
)
reversal = df2.where(df2.notna().all(axis=1)).stack()
df['reversal'] = df.index.isin(reversal)
Output:
matter date amount reversal
0 1 1/2/2022 10 True
1 1 1/2/2022 15 False
2 1 1/2/2022 -10 True
3 2 1/4/2022 12 False
4 2 1/5/2022 5 True
5 2 1/5/2022 -5 True
6 2 1/5/2022 5 False
Intermediate df2
:
sign -1 1
matter date n abs
1 1/2/2022 0 10 2.0 0.0 # both values, this is a reversal
15 NaN 1.0
2 1/4/2022 0 12 NaN 3.0
1/5/2022 0 5 5.0 4.0 # both values, this is a reversal
1 5 NaN 6.0 # this one is alone, not a reversal
variant
Same logic but using a double groupby
:
df['reversal'] = (df
.assign(sign=np.sign(df['amount']),
n=df.groupby(['matter', 'date', 'amount']).cumcount(),
abs=df['amount'].abs()
)
# if we have 2 values per group, this is a reversal
.groupby(['matter', 'date', 'abs', 'n']).transform('size').eq(2)
)