Home > OS >  How to divide group depend on idx, diff in R?
How to divide group depend on idx, diff in R?

Time:08-24

There is my dataset. I want to make group numbers depending on idx, diff. Exactly, I want to make the same number until diff over 14 days. It means that if the same idx, under diff 14 days, it should be the same group. But if they have the same idx, over 14 days, it should be different group.

idx = c("a","a","a","a","b","b","b","c","c","c","c")
date = c(20201115, 20201116, 20201117, 20201105, 20201107, 20201110, 20210113, 20160930, 20160504, 20160913, 20160927)
group = c("1","1","1","1","2","2","3","4","5","6","6")
df = data.frame(idx,date,group)
df <- df %>% arrange(idx,date)
df$date <- as.Date(as.character(df$date), format='%Y%m%d')
df <- df %>% group_by(idx) %>% 
  mutate(diff = date - lag(date))

This is the result of what I want. enter image description here

CodePudding user response:

Use cumsum to create another group criteria, and then cur_group_id().

library(dplyr)
df %>% 
  group_by(idx) %>% 
  mutate(diff = difftime(date, lag(date, default = first(date)), unit = "days"),
         cu = cumsum(diff >= 14)) %>% 
  group_by(idx, cu) %>% 
  mutate(group = cur_group_id()) %>% 
  ungroup() %>% 
  select(-cu)
# A tibble: 11 × 4
   idx   date       group diff    
   <chr> <date>     <int> <drtn>  
 1 a     2020-11-05     1   0 days
 2 a     2020-11-15     1  10 days
 3 a     2020-11-16     1   1 days
 4 a     2020-11-17     1   1 days
 5 b     2020-11-07     2   0 days
 6 b     2020-11-10     2   3 days
 7 b     2021-01-13     3  64 days
 8 c     2016-05-04     4   0 days
 9 c     2016-09-13     5 132 days
10 c     2016-09-27     6  14 days
11 c     2016-09-30     6   3 days

CodePudding user response:

Given that the first value of diff must be NA because of the use of lag(), you could use cumsum(diff >= 14 | is.na(diff) without grouping to create the new group:

library(dplyr)

df %>%
  group_by(idx) %>% 
  mutate(diff = date - lag(date)) %>% 
  ungroup() %>%
  mutate(group = cumsum(diff >= 14 | is.na(diff)))

# # A tibble: 11 × 4
#    idx   date       diff     group
#    <chr> <date>     <drtn>   <int>
#  1 a     2020-11-05  NA days     1
#  2 a     2020-11-15  10 days     1
#  3 a     2020-11-16   1 days     1
#  4 a     2020-11-17   1 days     1
#  5 b     2020-11-07  NA days     2
#  6 b     2020-11-10   3 days     2
#  7 b     2021-01-13  64 days     3
#  8 c     2016-05-04  NA days     4
#  9 c     2016-09-13 132 days     5
# 10 c     2016-09-27  14 days     6
# 11 c     2016-09-30   3 days     6
  • Related