I have a data.table
indicating the position of an animal which looks something like this:
Date TimeStamp Transponder Units
1: 2021-08-15 2021-08-15 14:11:13 DA2C614E M2
2: 2021-08-15 2021-08-15 14:11:14 DA2C614E M2
3: 2021-08-15 2021-08-15 14:11:14 DA2C614E M2
4: 2021-08-15 2021-08-15 14:11:15 DA2C614E M2
5: 2021-08-15 2021-08-15 14:11:15 DA2C614E M2
6: 2021-08-15 2021-08-15 14:11:16 DA2C614E M2
7: 2021-08-15 2021-08-15 14:12:40 DA2C614E HM2
8: 2021-08-15 2021-08-15 14:12:40 DA2C614E HM2
9: 2021-08-15 2021-08-15 14:12:49 DA2C614E H2
10: 2021-08-15 2021-08-15 14:18:02 DA2C614E H1
11: 2021-08-15 2021-08-15 14:18:04 DA2C614E H1
12: 2021-08-15 2021-08-15 14:19:19 DA2C614E H1
13: 2021-08-15 2021-08-15 14:25:29 DA2C614E HM2
The Transponder indicates the individual and the Units the position. I would like to create a more sparse version of this data set by deleting the duplicated rows, but only those which are in succession. Essentially, because the successive duplicates just mean no movement. Meaning I would want my end results to be:
Date TimeStamp Transponder Units
1: 2021-08-15 2021-08-15 14:11:13 DA2C614E M2
2: 2021-08-15 2021-08-15 14:12:40 DA2C614E HM2
3: 2021-08-15 2021-08-15 14:12:49 DA2C614E H2
4: 2021-08-15 2021-08-15 14:18:02 DA2C614E H1
5: 2021-08-15 2021-08-15 14:25:29 DA2C614E HM2
I have tried working with duplicated()
, but the problem is that this function deletes all duplicates, like for example the later repeated HM2
entry. (I ran duplicated on a subset of Date, Transponder and Units):
> sample[!duplicated(sample[, c(1,3,4)]),]
Date TimeStamp Transponder Units
1: 2021-08-15 2021-08-15 14:11:13 DA2C614E M2
2: 2021-08-15 2021-08-15 14:12:40 DA2C614E HM2
3: 2021-08-15 2021-08-15 14:12:49 DA2C614E H2
4: 2021-08-15 2021-08-15 14:18:02 DA2C614E H1
Any ideas on how to solve this "elegantly", i.e. without having to loop through this?
CodePudding user response:
Using rleid
from data.table
we create a dummy-grouping variable, and with distinct
from dplyr
we remove the duplicates. In your data you may want to include Transponder
in the rleid
function, if it does vary in your real data.
library(tidyverse)
library(data.table)
df %>%
mutate(dummy = rleid(Units)) %>%
distinct(dummy, .keep_all = T) %>%
select(-dummy)
Date TimeStamp Transponder Units
1 2021-08-15 2021-08-15-14:11:13 DA2C614E M2
2 2021-08-15 2021-08-15-14:12:40 DA2C614E HM2
3 2021-08-15 2021-08-15-14:12:49 DA2C614E H2
4 2021-08-15 2021-08-15-14:18:02 DA2C614E H1
5 2021-08-15 2021-08-15-14:25:29 DA2C614E HM2
Using just data.table
and no temporary variable you could do the following: dt[!duplicated(rleid(Units)),]
, based on comments.
CodePudding user response:
see Filtering out duplicated/non-unique rows in data.table
I think that if your data is time indexed this will work on your code:
unique(sample, by = "Date")
Example:
dt <- data.table(V1 = LETTERS[c(1,1,1,2,2,2)], V2 = c(1,1,1,2,2,2))
V1 V2
1: A 1
2: A 1
3: A 1
4: B 2
5: B 2
6: B 2
unique(dt, by = "V1")
V1 V2
1: A 1
2: B 2
CodePudding user response:
I wanted to have a go at this using data.table
, as you stated you are using it, presumably as your data is large, so this should be faster than the dplyr
approach suggested. I do not have much experience with data.table
so I wanted to try and this seems to work:
# Cols to subset
cols <- c("Date", "Transponder", "Units")
lagcols <- paste0(cols, "_lag")
# Create lag
sample[, (lagcols) := shift(.SD, n = 1, fill = NA, type = "lag"), .SDcols = cols]
# Create boolean mask if row == previous row in selected columns
sample[, equals_previous := Date == Date_lag & Transponder == Transponder_lag & Units == Units_lag]
# Delete lag columns
sample[, (lagcols) := NULL]
# Subset only rows where they are not equal to the previous row
sample[(!equals_previous)]
There may be more elegant ways to do this in data.table
, but if your data is of a significant size this should at least be faster than converting to a data.frame
and using dplyr
.