I have a data.frame (df) with two columns (Date & Count) which looks something like shown below:
Date Count
1/1/2022 5
1/2/2022 13
1/3/2022 21
1/4/2022 29
1/5/2022 37
1/6/2022 45
1/7/2022 53
1/8/2022 61
1/9/2022 69
1/10/2022 77
1/11/2022 85
1/12/2022 93
1/13/2022 101
1/14/2022 109
1/15/2022 117
Since I have single variable (count), the idea is to identify if there's been a change in mean in every three days, therefore I want to apply rolling t.test with a window of 3 days and save the resulting p-value next to Count column which I can plot later. Since I have seen people doing these sorts of tests with two variables usually, I can't figure out how to do it with a single variable.
For example, I saw this relevant answer here:
ttestFun <- function(dat) {
myTtest = t.test(x = dat[, 1], y = dat[, 2])
return(myTtest$p.value)
}
rollapply(df_ts, 7, FUN = ttestFun, fill = NA, by.column = FALSE)
But again, this is with two columns. Any guidance please?
CodePudding user response:
Irrespective of any discussion about the usefulness of the approach, given a fixed number of measurements of 3, you could just shift the counts by 3 and perform t-test between two columns as in your example, such as:
library(data.table)
set.seed(123)
dates <- seq(as.POSIXct("2022-01-01"), as.POSIXct("2022-02-01"), by = "1 day")
dt <- data.table(Date=dates, count = sample(1:200, length(dates), replace=TRUE), key="Date")
dt[, nxt:=shift(count, 3, type = "lead")]
dt[, group:=rep(1:ceiling(length(dates)/3), each=3)[seq_along(dates)]]
dt[, p:= tryCatch(t.test(count, nxt)$p.value, error=function(e) NA), by="group"][]
#> Date count nxt group p
#> 1: 2022-01-01 159 195 1 0.7750944
#> 2: 2022-01-02 179 170 1 0.7750944
#> 3: 2022-01-03 14 50 1 0.7750944
#> 4: 2022-01-04 195 118 2 0.2240362
#> 5: 2022-01-05 170 43 2 0.2240362
#> 6: 2022-01-06 50 14 2 0.2240362
#> 7: 2022-01-07 118 118 3 0.1763296
#> 8: 2022-01-08 43 153 3 0.1763296
#> 9: 2022-01-09 14 90 3 0.1763296
#> 10: 2022-01-10 118 91 4 0.8896343
#> 11: 2022-01-11 153 197 4 0.8896343
#> 12: 2022-01-12 90 91 4 0.8896343
#> 13: 2022-01-13 91 185 5 0.8065021
#> 14: 2022-01-14 197 92 5 0.8065021
#> 15: 2022-01-15 91 137 5 0.8065021
#> 16: 2022-01-16 185 99 6 0.1060465
#> 17: 2022-01-17 92 72 6 0.1060465
#> 18: 2022-01-18 137 26 6 0.1060465
#> 19: 2022-01-19 99 7 7 0.5283156
#> 20: 2022-01-20 72 170 7 0.5283156
#> 21: 2022-01-21 26 137 7 0.5283156
#> 22: 2022-01-22 7 164 8 0.9612965
#> 23: 2022-01-23 170 78 8 0.9612965
#> 24: 2022-01-24 137 81 8 0.9612965
#> 25: 2022-01-25 164 43 9 0.6111337
#> 26: 2022-01-26 78 103 9 0.6111337
#> 27: 2022-01-27 81 117 9 0.6111337
#> 28: 2022-01-28 43 76 10 0.6453494
#> 29: 2022-01-29 103 143 10 0.6453494
#> 30: 2022-01-30 117 NA 10 0.6453494
#> 31: 2022-01-31 76 NA 11 NA
#> 32: 2022-02-01 143 NA 11 NA
#> Date count nxt group p
Created on 2022-04-07 by the reprex package (v2.0.1)
You could further clean that up, e.g. by taking the first date per group:
dt[, .(Date=Date[1], count=round(mean(count), 2), p=p[1]), by="group"]
#> group Date count p
#> 1: 1 2022-01-01 117.33 0.7750944
#> 2: 2 2022-01-04 138.33 0.2240362
#> 3: 3 2022-01-07 58.33 0.1763296
#> 4: 4 2022-01-10 120.33 0.8896343
#> 5: 5 2022-01-13 126.33 0.8065021
#> 6: 6 2022-01-16 138.00 0.1060465
#> 7: 7 2022-01-19 65.67 0.5283156
#> 8: 8 2022-01-22 104.67 0.9612965
#> 9: 9 2022-01-25 107.67 0.6111337
#> 10: 10 2022-01-28 87.67 0.6453494
#> 11: 11 2022-01-31 109.50 NA
CodePudding user response:
You can create a grp, and then simply apply a t.test to each consecutive pair of groups:
d <- d %>% mutate(grp=rep(1:(n()/3), each=3))
d %>% left_join(
tibble(grp = 2:max(d$grp),
pval = sapply(2:max(d$grp), function(x) {
t.test(d %>% filter(grp==x) %>% pull(Count),
d %>% filter(grp==x-1) %>% pull(Count))$p.value
})
)) %>% group_by(grp) %>% slice_min(Date)
Output: (p-value is constant only because of the example data you provided)
Date Count grp pval
<date> <dbl> <int> <dbl>
1 2022-01-01 5 1 NA
2 2022-01-04 29 2 0.0213
3 2022-01-07 53 3 0.0213
4 2022-01-10 77 4 0.0213
5 2022-01-13 101 5 0.0213
Or a data.table approach:
setDT(d)[, `:=`(grp=rep(1:(nrow(d)/3), each=3),cy=shift(Count,3))] %>%
.[!is.na(cy), pval:=t.test(Count,cy)$p.value, by=grp] %>%
.[,.SD[1], by=grp, .SDcols=!c("cy")]
Output:
grp Date Count pval
<int> <Date> <num> <num>
1: 1 2022-01-01 5 NA
2: 2 2022-01-04 29 0.02131164
3: 3 2022-01-07 53 0.02131164
4: 4 2022-01-10 77 0.02131164
5: 5 2022-01-13 101 0.02131164