I'm trying to create a script which removes all rows which have a matching negative and positive unique ID totalling to nil.
Is there are function/ module that could easily and efficiently achieve this goal without having to go about the df.iterows route?
Original Dataset extract (full dataset has lots of different references and amounts)
reference | amount |
---|---|
5231 | 505 |
5231 | -505 |
5231 | 505 |
5231 | -505 |
5231 | -505 |
5231 | 505 |
5231 | 505 |
5231 | 505 |
I have created the ID column, however need to create a Unique_ID_Count column to identify the number of duplicates.
reference | amount | ID | Unique_ID_Count |
---|---|---|---|
5231 | 505 | 5231_505 | 5231_505_0 |
5231 | -505 | 5231_-505 | 5231_-505_0 |
5231 | 505 | 5231_505 | 5231_505_1 |
5231 | -505 | 5231_-505 | 5231_-505_1 |
5231 | -505 | 5231_-505 | 5231_-505_2 |
5231 | 505 | 5231_505 | 5231_505_2 |
5231 | 505 | 5231_505 | 5231_505_3 |
5231 | 505 | 5231_505 | 5231_505_4 |
Once I've identified my duplicates I need to remove all instances where there is a corresponding positive and negative duplicate with the same count. This removes all rows that net off to nil.
reference | amount | ID | Unique_ID_Count |
---|---|---|---|
5231 | 505 | 5231_505 | 5231_505_3 |
5231 | 505 | 5231_505 | 5231_505_4 |
Any help would be much appreciated as I feel like I'm going in circles thinking about how to achieve this.
CodePudding user response:
Something this will help you out
cols = ['reference','amount']
df["ID"] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
df["Unique_ID_Count"] = df.groupby(["ID"]).cumcount() 1
df["Unique_ID_Count"] = df["key"] '_' df["Unique_ID_Count"]
CodePudding user response:
What if you groupby
your 'reference' column and sum
your 'amount' column. This will filter out naturally the rows that are nil:
res = df.groupby('reference',as_index=False)['amount'].sum()
print(res)
reference amount
0 5231 505
Then you can drop from your main data frame res
's index rows:
other_rows = df.loc[~df.index.isin(res.index.tolist())]
print(other_rows)
reference amount
1 5231 -505
2 5231 505
3 5231 -505
4 5231 -505
5 5231 505
6 5231 505