Removing rows in R only if they are duplicated in direct succession-CodePudding

I have a data.table indicating the position of an animal which looks something like this:

    Date       TimeStamp             Transponder Units
 1: 2021-08-15 2021-08-15 14:11:13    DA2C614E    M2
 2: 2021-08-15 2021-08-15 14:11:14    DA2C614E    M2
 3: 2021-08-15 2021-08-15 14:11:14    DA2C614E    M2
 4: 2021-08-15 2021-08-15 14:11:15    DA2C614E    M2
 5: 2021-08-15 2021-08-15 14:11:15    DA2C614E    M2
 6: 2021-08-15 2021-08-15 14:11:16    DA2C614E    M2
 7: 2021-08-15 2021-08-15 14:12:40    DA2C614E   HM2
 8: 2021-08-15 2021-08-15 14:12:40    DA2C614E   HM2
 9: 2021-08-15 2021-08-15 14:12:49    DA2C614E    H2
10: 2021-08-15 2021-08-15 14:18:02    DA2C614E    H1
11: 2021-08-15 2021-08-15 14:18:04    DA2C614E    H1
12: 2021-08-15 2021-08-15 14:19:19    DA2C614E    H1
13: 2021-08-15 2021-08-15 14:25:29    DA2C614E   HM2

The Transponder indicates the individual and the Units the position. I would like to create a more sparse version of this data set by deleting the duplicated rows, but only those which are in succession. Essentially, because the successive duplicates just mean no movement. Meaning I would want my end results to be:

      Date       TimeStamp             Transponder Units
   1: 2021-08-15 2021-08-15 14:11:13    DA2C614E    M2
   2: 2021-08-15 2021-08-15 14:12:40    DA2C614E   HM2
   3: 2021-08-15 2021-08-15 14:12:49    DA2C614E    H2
   4: 2021-08-15 2021-08-15 14:18:02    DA2C614E    H1
   5: 2021-08-15 2021-08-15 14:25:29    DA2C614E   HM2

I have tried working with duplicated(), but the problem is that this function deletes all duplicates, like for example the later repeated HM2 entry. (I ran duplicated on a subset of Date, Transponder and Units):

> sample[!duplicated(sample[, c(1,3,4)]),]
         Date           TimeStamp Transponder Units
1: 2021-08-15 2021-08-15 14:11:13    DA2C614E    M2
2: 2021-08-15 2021-08-15 14:12:40    DA2C614E   HM2
3: 2021-08-15 2021-08-15 14:12:49    DA2C614E    H2
4: 2021-08-15 2021-08-15 14:18:02    DA2C614E    H1

Any ideas on how to solve this "elegantly", i.e. without having to loop through this?

CodePudding user response：

Using rleid from data.table we create a dummy-grouping variable, and with distinct from dplyr we remove the duplicates. In your data you may want to include Transponder in the rleid function, if it does vary in your real data.

library(tidyverse)
library(data.table)

df %>% 
  mutate(dummy = rleid(Units)) %>% 
  distinct(dummy, .keep_all = T) %>% 
  select(-dummy)

        Date           TimeStamp Transponder Units
1 2021-08-15 2021-08-15-14:11:13    DA2C614E    M2
2 2021-08-15 2021-08-15-14:12:40    DA2C614E   HM2
3 2021-08-15 2021-08-15-14:12:49    DA2C614E    H2
4 2021-08-15 2021-08-15-14:18:02    DA2C614E    H1
5 2021-08-15 2021-08-15-14:25:29    DA2C614E   HM2

Using just data.table and no temporary variable you could do the following: dt[!duplicated(rleid(Units)),], based on comments.

CodePudding user response：

see Filtering out duplicated/non-unique rows in data.table

I think that if your data is time indexed this will work on your code:

unique(sample, by = "Date")

Example:

dt <- data.table(V1 = LETTERS[c(1,1,1,2,2,2)], V2 = c(1,1,1,2,2,2))

   V1 V2
1:  A  1
2:  A  1
3:  A  1
4:  B  2
5:  B  2
6:  B  2

unique(dt, by = "V1")

   V1 V2
1:  A  1
2:  B  2

CodePudding user response：

I wanted to have a go at this using data.table, as you stated you are using it, presumably as your data is large, so this should be faster than the dplyr approach suggested. I do not have much experience with data.table so I wanted to try and this seems to work:

# Cols to subset
cols <- c("Date", "Transponder", "Units")
lagcols <- paste0(cols, "_lag")

# Create lag
sample[, (lagcols) := shift(.SD, n = 1, fill = NA, type = "lag"), .SDcols = cols]

# Create boolean mask if row == previous row in selected columns
sample[, equals_previous := Date == Date_lag & Transponder == Transponder_lag & Units == Units_lag]

# Delete lag columns
sample[, (lagcols) := NULL]

# Subset only rows where they are not equal to the previous row
sample[(!equals_previous)]

There may be more elegant ways to do this in data.table, but if your data is of a significant size this should at least be faster than converting to a data.frame and using dplyr.