Home > Software design >  How to compare within a group the first value to each subsequent value until a condition is met
How to compare within a group the first value to each subsequent value until a condition is met

Time:11-09

So I have a data frame in the general structure below:

dataframe:

rownum group date
1 a 2021-05-01
2 a 2021-05-02
3 a 2021-05-03
4 b 2021-05-15
5 b 2021-05-17
6 b 2021-05-30
7 b 2021-05-31
8 b 2021-05-31
9 c 2021-05-01
10 c 2021-05-05

What I would like to do is, WITHIN GROUP, compare the first row to the next row, until the difference between the dates meets some threshold, say 10 days. Then, once that row meets the threshold, I'd like to then test the next row against the subsequent row. It would look like this:

Result, using threshold of 10:

|rownum|group |date       |date diff|
|------|------|-----------|---|        
|1     | a    |2021-05-01 |NA|
|2     | a    |2021-05-02 |1|
|3     | a    |2021-05-03 |2|
|4     | b    |2021-05-15 |NA|
|5     | b    |2021-05-17 |2|
|6     | b    |2021-05-30 |15 (meets criteria, start from row 7 now)|
|7     | b    |2021-05-31 | NA|
|8     | b    |2021-05-31 | 0|
|9     | c    |2021-05-01 | NA|
|10    | c    |2021-05-05 |  4|

So to reiterate, its comparing the FIRST row of a group to subsequent rows until some threshold is met. Then the count starts over at the first rep after that within the group to subsequent rows within the group. The difference is recorded as datediff.

I've tried this but I dont know if sapply is the way to go:

dataframe %>% 
    group_by(group) %>%
    mutate(
        datediff = sapply(date, function(x) {
            all(difftime(dataframe$date,dplyr::lag(dataframe, n = 1, default = NA)))
                }
        )
    )

Also tried this, which I think is closer to what I want:

for (m in 1:length(dataframe)) {
    dataframe <- dataframe %>% 
        group_by(group) %>% 
        rowwise() %>% 
        mutate(datediff = difftime(dataframe$date,dplyr::lag(date, n = m, default = NA), units="days"))
    }

So far I havent been able to get the right rowwise comparison to even implement the thresholding bit.

CodePudding user response:

Here's a roundabout way of getting what you're looking for, where some of your NA are set to 0 using this solution:

library(tidyverse)

df %>% 
  group_by(group) %>% 
  mutate(date = as.Date(date),
         date_diff = date - first(date),
         flag = date_diff > 10) %>% 
  group_by(group, flag) %>% 
  mutate(temp_group = cur_group_id()) %>% 
  group_by(temp_group) %>% 
  mutate(date_diff = case_when(date_diff == first(date_diff) ~ date_diff,
                               date_diff != first(date_diff) & date_diff < 10 ~ date - first(date),
                               date_diff != first(date_diff) & date_diff > 10 ~ date - nth(date, 2))) %>% 
  ungroup() %>% 
  select(group, date, date_diff) 


# A tibble: 10 x 3
   group date       date_diff
   <chr> <date>     <drtn>   
 1 a     2021-05-01  0 days  
 2 a     2021-05-02  1 days  
 3 a     2021-05-03  2 days  
 4 b     2021-05-15  0 days  
 5 b     2021-05-17  2 days  
 6 b     2021-05-30 15 days  
 7 b     2021-05-31  0 days  
 8 b     2021-05-31  0 days  
 9 c     2021-05-01  0 days  
10 c     2021-05-05  4 days 

CodePudding user response:

base R

func <- function(x, threshold = 10) {
  r <- rle(c(0, diff(x)) > threshold)
  if ((len <- length(r$values)) > 1) {
    r$lengths[len] <- r$lengths[len] - 1L
    r$lengths[1] <- r$lengths[1]   1L
  }
  cumsum(inverse.rle(r))
}
dat$group2 <- ave(as.numeric(dat$date), dat$group, FUN = func)
dat$datediff <- ave(as.numeric(dat$date), dat[,c("group", "group2")], FUN = function(x) c(NA, (x - x[1])[-1]))
dat$group2 <- NULL
dat
#    rownum group       date datediff
# 1       1     a 2021-05-01       NA
# 2       2     a 2021-05-02        1
# 3       3     a 2021-05-03        2
# 4       4     b 2021-05-15       NA
# 5       5     b 2021-05-17        2
# 6       6     b 2021-05-30       15
# 7       7     b 2021-05-31       NA
# 8       8     b 2021-05-31        0
# 9       9     c 2021-05-01       NA
# 10     10     c 2021-05-05        4

dplyr

library(dplyr)
dat %>%
  group_by(group) %>%
  mutate(group2 = func(date)) %>%
  group_by(group, group2) %>%
  mutate(datediff = c(NA, (date - date[1])[-1])) %>%
  ungroup() %>%
  select(-group2)
# # A tibble: 10 x 4
#    rownum group date       datediff
#     <int> <chr> <date>        <dbl>
#  1      1 a     2021-05-01       NA
#  2      2 a     2021-05-02        1
#  3      3 a     2021-05-03        2
#  4      4 b     2021-05-15       NA
#  5      5 b     2021-05-17        2
#  6      6 b     2021-05-30       15
#  7      7 b     2021-05-31       NA
#  8      8 b     2021-05-31        0
#  9      9 c     2021-05-01       NA
# 10     10 c     2021-05-05        4

Data

dat <- structure(list(rownum = 1:10, group = c("a", "a", "a", "b", "b", "b", "b", "b", "c", "c"), date = structure(c(18748, 18749, 18750, 18762, 18764, 18777, 18778, 18778, 18748, 18752), class = "Date")), row.names = c(NA, -10L), class = "data.frame")

(I already converted dat$date to Date-class.)

  • Related