How to apply rolling t.test in R on a single variable?-CodePudding

I have a data.frame (df) with two columns (Date & Count) which looks something like shown below:

Date       Count
1/1/2022    5   
1/2/2022    13  
1/3/2022    21  
1/4/2022    29  
1/5/2022    37  
1/6/2022    45  
1/7/2022    53  
1/8/2022    61  
1/9/2022    69  
1/10/2022   77  
1/11/2022   85  
1/12/2022   93  
1/13/2022   101 
1/14/2022   109 
1/15/2022   117

Since I have single variable (count), the idea is to identify if there's been a change in mean in every three days, therefore I want to apply rolling t.test with a window of 3 days and save the resulting p-value next to Count column which I can plot later. Since I have seen people doing these sorts of tests with two variables usually, I can't figure out how to do it with a single variable.

For example, I saw this relevant answer here:

ttestFun <- function(dat) {
   myTtest = t.test(x = dat[, 1], y = dat[, 2])
   return(myTtest$p.value)
 }

rollapply(df_ts, 7, FUN = ttestFun, fill = NA, by.column = FALSE)

But again, this is with two columns. Any guidance please?

CodePudding user response：

Irrespective of any discussion about the usefulness of the approach, given a fixed number of measurements of 3, you could just shift the counts by 3 and perform t-test between two columns as in your example, such as:

library(data.table)
set.seed(123)
dates <- seq(as.POSIXct("2022-01-01"), as.POSIXct("2022-02-01"), by = "1 day")
dt <- data.table(Date=dates, count = sample(1:200, length(dates), replace=TRUE), key="Date")
dt[, nxt:=shift(count, 3, type = "lead")]
dt[, group:=rep(1:ceiling(length(dates)/3), each=3)[seq_along(dates)]]
dt[, p:= tryCatch(t.test(count, nxt)$p.value, error=function(e) NA), by="group"][]
#>           Date count nxt group         p
#>  1: 2022-01-01   159 195     1 0.7750944
#>  2: 2022-01-02   179 170     1 0.7750944
#>  3: 2022-01-03    14  50     1 0.7750944
#>  4: 2022-01-04   195 118     2 0.2240362
#>  5: 2022-01-05   170  43     2 0.2240362
#>  6: 2022-01-06    50  14     2 0.2240362
#>  7: 2022-01-07   118 118     3 0.1763296
#>  8: 2022-01-08    43 153     3 0.1763296
#>  9: 2022-01-09    14  90     3 0.1763296
#> 10: 2022-01-10   118  91     4 0.8896343
#> 11: 2022-01-11   153 197     4 0.8896343
#> 12: 2022-01-12    90  91     4 0.8896343
#> 13: 2022-01-13    91 185     5 0.8065021
#> 14: 2022-01-14   197  92     5 0.8065021
#> 15: 2022-01-15    91 137     5 0.8065021
#> 16: 2022-01-16   185  99     6 0.1060465
#> 17: 2022-01-17    92  72     6 0.1060465
#> 18: 2022-01-18   137  26     6 0.1060465
#> 19: 2022-01-19    99   7     7 0.5283156
#> 20: 2022-01-20    72 170     7 0.5283156
#> 21: 2022-01-21    26 137     7 0.5283156
#> 22: 2022-01-22     7 164     8 0.9612965
#> 23: 2022-01-23   170  78     8 0.9612965
#> 24: 2022-01-24   137  81     8 0.9612965
#> 25: 2022-01-25   164  43     9 0.6111337
#> 26: 2022-01-26    78 103     9 0.6111337
#> 27: 2022-01-27    81 117     9 0.6111337
#> 28: 2022-01-28    43  76    10 0.6453494
#> 29: 2022-01-29   103 143    10 0.6453494
#> 30: 2022-01-30   117  NA    10 0.6453494
#> 31: 2022-01-31    76  NA    11        NA
#> 32: 2022-02-01   143  NA    11        NA
#>           Date count nxt group         p

^{Created on 2022-04-07 by the reprex package (v2.0.1)}

You could further clean that up, e.g. by taking the first date per group:

dt[, .(Date=Date[1], count=round(mean(count), 2), p=p[1]), by="group"]
#>     group       Date  count         p
#>  1:     1 2022-01-01 117.33 0.7750944
#>  2:     2 2022-01-04 138.33 0.2240362
#>  3:     3 2022-01-07  58.33 0.1763296
#>  4:     4 2022-01-10 120.33 0.8896343
#>  5:     5 2022-01-13 126.33 0.8065021
#>  6:     6 2022-01-16 138.00 0.1060465
#>  7:     7 2022-01-19  65.67 0.5283156
#>  8:     8 2022-01-22 104.67 0.9612965
#>  9:     9 2022-01-25 107.67 0.6111337
#> 10:    10 2022-01-28  87.67 0.6453494
#> 11:    11 2022-01-31 109.50        NA

CodePudding user response：

You can create a grp, and then simply apply a t.test to each consecutive pair of groups:

d <- d %>% mutate(grp=rep(1:(n()/3), each=3))
d %>% left_join(
  tibble(grp = 2:max(d$grp),
         pval = sapply(2:max(d$grp), function(x) {
           t.test(d %>% filter(grp==x) %>% pull(Count),
                  d %>% filter(grp==x-1) %>% pull(Count))$p.value
           })
  )) %>% group_by(grp) %>% slice_min(Date)

Output: (p-value is constant only because of the example data you provided)

  Date       Count   grp    pval
  <date>     <dbl> <int>   <dbl>
1 2022-01-01     5     1 NA     
2 2022-01-04    29     2  0.0213
3 2022-01-07    53     3  0.0213
4 2022-01-10    77     4  0.0213
5 2022-01-13   101     5  0.0213

Or a data.table approach:

setDT(d)[, `:=`(grp=rep(1:(nrow(d)/3), each=3),cy=shift(Count,3))] %>% 
  .[!is.na(cy), pval:=t.test(Count,cy)$p.value, by=grp] %>% 
  .[,.SD[1], by=grp, .SDcols=!c("cy")]

Output:

     grp       Date Count       pval
   <int>     <Date> <num>      <num>
1:     1 2022-01-01     5         NA
2:     2 2022-01-04    29 0.02131164
3:     3 2022-01-07    53 0.02131164
4:     4 2022-01-10    77 0.02131164
5:     5 2022-01-13   101 0.02131164