Delete any row which has a non-unique timestamp-CodePudding

I am stuck on a bit of code for my dissertation which is due in a few days so help would be really appreciated.

I have a numPy array that looks like this:

[['2017-01-30T06:00:00.000000000', 48.67, 55.04],
['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]...

I am trying to check the timestamp of each row, and if any timestamp appears more than once, I want to remove all rows which have that timestamp. So the resulting array would look like this:

[['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]...

i.e. the rows at 6am and 8am are deleted because they appear more than once.

I have tried using np.unique but cannot get this to work. I have also tried looping through the array and checking if the timestamp is equal to the previous timestamp, and then delete both of these however that does not work if there is a third or more instance of the same timestamp.

I am really stuck for time so any help would be greatly, greatly appreciated.

The code I have tried so far is this:

def del_duplicate_rows(data):
  date_times = []
  for d in data:
    date_times.append(d)
    if len(date_times) > 1:
      if date_times[d] == date_times[d-1]:
        data = np.delete(data, d, axis=0)
        data = np.delete(data, d-1, axis=0)
  return data

CodePudding user response：

No need for pandas or numpy. All you need is Python's built-in itertools:

import itertools

groups = list(list(group) for key, group in itertools.groupby(data, lambda x: x[0]))

result = list(itertools.chain.from_iterable(group for group in groups if len(group) == 1))
print(result)

With the given data, this outputs:

[['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]]

Explanation:

Since the timestamps are sorted, we can use .groupby() to group entries with the same timestamp together into one list.

Then, we iterate over these groups, retaining those with only one entry. Then, we flatten the remaining one-entry group lists using chain.from_iterable() to obtain our desired result.

CodePudding user response：

If you happen to have the data in a pandas dataframe already, you can do

df.drop_duplicates(subset='date', # or whatever the first column is called
                   keep=False)

CodePudding user response：

For finding unique values and counting them you can use np.unique(..., return_counts=True, return_index=True) then you can find values that count == 1 then find the index and return the original array from finding index like below:

a = np.array([['2017-01-30T06:00:00.000000000', 48.67, 55.04],
['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]], dtype='object')

unq, idx, cnt =  np.unique(a[:,0], return_index=True, return_counts=True)
out = a[idx[cnt==1]]
print(out)

Output:

[['2017-01-30T07:00:00.000000000' 48.67262295 55.04]
 ['2017-01-30T09:00:00.000000000' 48.67262295 55.04]]

CodePudding user response：

Here you are using numpy.unique:

import numpy
data = numpy.array([
    ['2017-01-30T06:00:00.000000000', 48.67, 55.04],
    ['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
    ['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
    ['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
    ['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
    ['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]
], dtype='object')
result = data[numpy.unique(data[::, 1], return_index=True)[1]].tolist()

Result:

[
    ['2017-01-30T08:00:00.000000000', 48.55544345, 55.04], 
    ['2017-01-30T06:00:00.000000000', 48.67, 55.04], 
    ['2017-01-30T07:00:00.000000000', 48.67262295, 55.04], 
    ['2017-01-30T06:00:00.000000000', 49.55249735, 55.04]
]