Remove first values repeated in an array... Python, Numpy, Pandas, Arrays-CodePudding

so I do have this NumPy array result(final), and I want to reduce it, I mean, if the value is repeated, then I want to delete the first value and maintain the second,third value repeated and so on...

import hmac
import hashlib
import time
from argparse import _MutuallyExclusiveGroup
from tkinter import *
import pandas as pd
import base64
import matplotlib.pyplot as plt
import numpy as np


key="800070FF00FF08012"
key=bytes(key,'utf-8')
collision=[]
for x in range(1,1000001):
    msg=bytes(f'{x}','utf-8')
    digest = hmac.new(key, msg,"sha256").digest()
    code = base64.b64encode(digest).decode('utf-8')
    code=code[:6]
    key=key.replace(key,digest)
    collision.append(code)

df=pd.DataFrame(collision)
df=df[df.duplicated(keep=False)]
df_index=df.index.to_numpy()
df=df.values.flatten()
final=np.stack((df_index,df),axis=1)

Results of the variable "final":

I HAVE:
[[14093 'JRp1kX']
 [43985 'KGlW7X']
 [59212 'pU97Tr']
 [90668 'ecTjTB']
 [140615 'JRp1kX']
 [218480 '25gtjT']
 [344174 'dtXg6E']
 [380467 'DdHQ3M']
 [395699 'vnFw/c']
 [503504 'dtXg6E']
 [531073 'KGlW7X']
 [633091 'ecTjTB']
 [671091 'vnFw/c']
 [672111 '25gtjT']
 [785568 'pU97Tr']
 [991540 'DdHQ3M']
 [991548 'JRp1kX']]


And I WANT TO HAVE:
 [[140615 'JRp1kX']
 [503504 'dtXg6E']
 [531073 'KGlW7X']
 [633091 'ecTjTB']
 [671091 'vnFw/c']
 [672111 '25gtjT']
 [785568 'pU97Tr']
 [991540 'DdHQ3M']
 [991548 'JRp1kX']]

Eliminating the first values that were repeated in the array. Does someone have some code that could work for my case?

In more simple terms it would be, if you have this list [1,2,3,4,5,1,3,5,5] I would like to have [2,4,1,3,5,5]

CodePudding user response：

df = pd.DataFrame([1, 2, 3, 4, 5, 1, 3, 5, 5])

# keep the unique rows
unique_mask = ~df.duplicated(keep=False)

# keep the repeated rows (skipping the first for each non-unique)
repeated_mask = df.duplicated()

df.loc[unique_mask | repeated_mask]

   0
1  2
3  4
5  1
6  3
7  5
8  5

CodePudding user response：

Maybe you could create for cycle.

to_remove = list()

for i in range(len(your_list)):
   if your_list[i] in your_list[i:]:
      to_remove.append(i)

removed_count = 0
for i in to_remove:
   del your_list[i - removed_count]
   removed_count  = 1

You cannot del instantly in the first cycle because i is gonna iterate next number, which would lead to skipping number every time you delete one.

[i - removed_count] because every time you delete lower index then higher indexes gets instantly decreased by one.

I think it could be written in more effective way but this shoudl work, maybe with little changes.

CodePudding user response：

After you generate df, add the following lines:

df=pd.DataFrame(collision)
# ... your code ends here
removed_already=[]
for idx in df[df.duplicated(keep=False)].index:
    if df.loc[idx][0] not in removed_already:
         removed_already.append(df.loc[idx][0])
         df.drop(index=idx, inplace=True)
# your code continues
df_index=df.index.to_numpy()
df=df.values.flatten()
final=np.stack((df_index,df),axis=1)

CodePudding user response：

final is a numpy array, so you can use np.unique on the second column to get the indices of the first occurrence and number of occurrences to avoid deleting single values

_, idx, counts = np.unique(final[:, 1], return_index=True, return_counts=True)
idx = idx[counts > 1]
final = np.delete(final, idx, axis=0)

This will work on the ndarray, for your second 1d array example use

_, idx, counts = np.unique(final, return_index=True, return_counts=True)