So I have a data frame in the general structure below:
dataframe:
rownum | group | date |
---|---|---|
1 | a | 2021-05-01 |
2 | a | 2021-05-02 |
3 | a | 2021-05-03 |
4 | b | 2021-05-15 |
5 | b | 2021-05-17 |
6 | b | 2021-05-30 |
7 | b | 2021-05-31 |
8 | b | 2021-05-31 |
9 | c | 2021-05-01 |
10 | c | 2021-05-05 |
What I would like to do is, WITHIN GROUP, compare the first row to the next row, until the difference between the dates meets some threshold, say 10 days. Then, once that row meets the threshold, I'd like to then test the next row against the subsequent row. It would look like this:
Result, using threshold of 10:
|rownum|group |date |date diff|
|------|------|-----------|---|
|1 | a |2021-05-01 |NA|
|2 | a |2021-05-02 |1|
|3 | a |2021-05-03 |2|
|4 | b |2021-05-15 |NA|
|5 | b |2021-05-17 |2|
|6 | b |2021-05-30 |15 (meets criteria, start from row 7 now)|
|7 | b |2021-05-31 | NA|
|8 | b |2021-05-31 | 0|
|9 | c |2021-05-01 | NA|
|10 | c |2021-05-05 | 4|
So to reiterate, its comparing the FIRST row of a group to subsequent rows until some threshold is met. Then the count starts over at the first rep after that within the group to subsequent rows within the group. The difference is recorded as datediff.
I've tried this but I dont know if sapply is the way to go:
dataframe %>%
group_by(group) %>%
mutate(
datediff = sapply(date, function(x) {
all(difftime(dataframe$date,dplyr::lag(dataframe, n = 1, default = NA)))
}
)
)
Also tried this, which I think is closer to what I want:
for (m in 1:length(dataframe)) {
dataframe <- dataframe %>%
group_by(group) %>%
rowwise() %>%
mutate(datediff = difftime(dataframe$date,dplyr::lag(date, n = m, default = NA), units="days"))
}
So far I havent been able to get the right rowwise comparison to even implement the thresholding bit.
CodePudding user response:
Here's a roundabout way of getting what you're looking for, where some of your NA
are set to 0
using this solution:
library(tidyverse)
df %>%
group_by(group) %>%
mutate(date = as.Date(date),
date_diff = date - first(date),
flag = date_diff > 10) %>%
group_by(group, flag) %>%
mutate(temp_group = cur_group_id()) %>%
group_by(temp_group) %>%
mutate(date_diff = case_when(date_diff == first(date_diff) ~ date_diff,
date_diff != first(date_diff) & date_diff < 10 ~ date - first(date),
date_diff != first(date_diff) & date_diff > 10 ~ date - nth(date, 2))) %>%
ungroup() %>%
select(group, date, date_diff)
# A tibble: 10 x 3
group date date_diff
<chr> <date> <drtn>
1 a 2021-05-01 0 days
2 a 2021-05-02 1 days
3 a 2021-05-03 2 days
4 b 2021-05-15 0 days
5 b 2021-05-17 2 days
6 b 2021-05-30 15 days
7 b 2021-05-31 0 days
8 b 2021-05-31 0 days
9 c 2021-05-01 0 days
10 c 2021-05-05 4 days
CodePudding user response:
base R
func <- function(x, threshold = 10) {
r <- rle(c(0, diff(x)) > threshold)
if ((len <- length(r$values)) > 1) {
r$lengths[len] <- r$lengths[len] - 1L
r$lengths[1] <- r$lengths[1] 1L
}
cumsum(inverse.rle(r))
}
dat$group2 <- ave(as.numeric(dat$date), dat$group, FUN = func)
dat$datediff <- ave(as.numeric(dat$date), dat[,c("group", "group2")], FUN = function(x) c(NA, (x - x[1])[-1]))
dat$group2 <- NULL
dat
# rownum group date datediff
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
dplyr
library(dplyr)
dat %>%
group_by(group) %>%
mutate(group2 = func(date)) %>%
group_by(group, group2) %>%
mutate(datediff = c(NA, (date - date[1])[-1])) %>%
ungroup() %>%
select(-group2)
# # A tibble: 10 x 4
# rownum group date datediff
# <int> <chr> <date> <dbl>
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
Data
dat <- structure(list(rownum = 1:10, group = c("a", "a", "a", "b", "b", "b", "b", "b", "c", "c"), date = structure(c(18748, 18749, 18750, 18762, 18764, 18777, 18778, 18778, 18748, 18752), class = "Date")), row.names = c(NA, -10L), class = "data.frame")
(I already converted dat$date
to Date
-class.)