Home > Enterprise >  Filtering (temporally similar) duplicates while keeping max value based on another column in R
Filtering (temporally similar) duplicates while keeping max value based on another column in R

Time:07-20

I'm hoping to clean out a time series dataset so that only the maximum value of each event is retained. To start, I filtered the data so that only values above a certain threshold are maintained but there are still values that, while separated by a millisecond or two, act as duplicate values but will throw off later analysis.

My initial dataset has >100,000 rows and a few more columns but here is the top of a smaller version.

head(shortfilter)
  Time (Sec) ECG (Channel 6)
1   5534.023        1.371761
2   5534.024        1.232424
3   5534.152        1.414432
4   5534.153        1.359914
5   5534.272        1.639033
6   5534.396        1.476161

Explained: I don't have a concrete time value that they need to be within for it to be considered a duplicate, but the rest of the data is similar to this in that they are generally within .003 s.

 Time (Sec) ECG (Channel 6)
1   5534.023        1.371761 #<-- Higher value (keep)
2   5534.024        1.232424
3   5534.152        1.414432 #<-- Higher value (keep)
4   5534.153        1.359914
5   5534.272        1.639033 #<-- Only value (keep)
6   5534.396        1.476161 #<-- Only value (keep)

Ideal:

 Time (Sec) ECG (Channel 6)
1   5534.023        1.371761
2   5534.152        1.414432
3   5534.272        1.639033
4   5534.396        1.476161
5   ____.___        _.______
6   ____.___        _.______

I'll add my initial attempt at some conditionals to do what I was hoping, but keep in mind I'm new to coding in general and so I know it isn't remotely correct, just wanted to get some ideas out there. Hope it can give some additional info on what I hope to do though. I'm positive the formatting & syntax are complete gibberish but I'm sure many of you will understand what I was going for lol...

for (i in shortfilter$`Time (Sec)`){
  for (j in shortfilter$`ECG (Channel 6)`){
    if ((i 1)-i > 0.01 && j > j 1){
      remove(j 1)
    } else if ((i 1)-i > 0.01 && j < j 1){
      remove(j)
    }
  }
}

CodePudding user response:

Welcome to StackOverflow! My solution compares each value to the next value and finds the difference, then adjusts the predicted grouping number based on those values. Currently it can handle up to five consecutive duplicated numbers, but you can easily add more if you would like.

library(tidyverse)

tibble::tribble(
   ~`Time`, ~`ECG`,
  5534.023, 1.371761,
  5534.024, 1.232424,
  5534.025, 1.27,
  5534.026, 1.28,
  5534.152, 1.414432,
  5534.153, 1.359914,
  5534.272, 1.639033,
  5534.396, 1.476161
  ) %>%
  arrange(Time) %>%
  mutate(sim_val = if_else(!is.na(lead(Time)), lead(Time) - Time, 5),
         Num = if_else(sim_val <= 0.03, row_number()   1, as.numeric(row_number())),
         Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num   1, Num),
         Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num   1, Num),
         Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num   1, Num)) %>%
  arrange(Num, desc(ECG)) %>%
  group_by(Num) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  select(Time, ECG)

Also, feel free to fine-tune the threshold of 0.03 to your data. Let me know if this works!

  • Related