Home > Blockchain >  Python dataframe iterate and assign unique value for each duplicate
Python dataframe iterate and assign unique value for each duplicate

Time:02-17

I'm trying to create a script which removes all rows which have a matching negative and positive unique ID totalling to nil.

Is there are function/ module that could easily and efficiently achieve this goal without having to go about the df.iterows route?

Original Dataset extract (full dataset has lots of different references and amounts)

reference amount
5231 505
5231 -505
5231 505
5231 -505
5231 -505
5231 505
5231 505
5231 505

I have created the ID column, however need to create a Unique_ID_Count column to identify the number of duplicates.

reference amount ID Unique_ID_Count
5231 505 5231_505 5231_505_0
5231 -505 5231_-505 5231_-505_0
5231 505 5231_505 5231_505_1
5231 -505 5231_-505 5231_-505_1
5231 -505 5231_-505 5231_-505_2
5231 505 5231_505 5231_505_2
5231 505 5231_505 5231_505_3
5231 505 5231_505 5231_505_4

Once I've identified my duplicates I need to remove all instances where there is a corresponding positive and negative duplicate with the same count. This removes all rows that net off to nil.

reference amount ID Unique_ID_Count
5231 505 5231_505 5231_505_3
5231 505 5231_505 5231_505_4

Any help would be much appreciated as I feel like I'm going in circles thinking about how to achieve this.

CodePudding user response:

Something this will help you out

cols = ['reference','amount']
df["ID"] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)   
df["Unique_ID_Count"] = df.groupby(["ID"]).cumcount() 1
df["Unique_ID_Count"] = df["key"]   '_'   df["Unique_ID_Count"]

CodePudding user response:

What if you groupby your 'reference' column and sum your 'amount' column. This will filter out naturally the rows that are nil:

res = df.groupby('reference',as_index=False)['amount'].sum()

print(res)
   reference  amount
0       5231     505

Then you can drop from your main data frame res's index rows:

other_rows = df.loc[~df.index.isin(res.index.tolist())]

print(other_rows)

  reference  amount
1       5231    -505
2       5231     505
3       5231    -505
4       5231    -505
5       5231     505
6       5231     505
  • Related