Summarizing within Conditions Repeated over Time-CodePudding

I am trying to summarize data within time intervals using a data set with conditions repeated over time at varying intervals. I would like to get means and standard deviations within time intervals for each of the conditions.

Some data.However, in my real data I don't know how many intervals of each condition there will be. I thought perhaps I could indicate the end of an interval by a change in Condition from one row to the next row. But I don't know how to code that.

library(tidyverse)

df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
        rowwise() %>%
        mutate(X = rnorm(1.4, 0.3))

I'm trying to calculate mean(X) and sd(X) for each interval of Condition (made up numbers):

Condition   interval        mean(X)   sd(X)
A            [160,190]       1.4      0.32
B            [190.05,230]    1.46     0.36
C            [230.05,260]    1.32     0.26
A            [260.05,293]    1.5      0.40
B            [293.05,321]    1.25     0.34
C            [321.05,352]    1.43     0.41

I've tried this, but it doesn't do what I need:

df %>%  
        group_by(Condition) %>%
        mutate(interval = cut(Time,
                              breaks = c(floor(min(Time)), ceiling(max(Time))),
                              include.lowest = F, 
                              right = F)) %>%
        group_by(Condition, interval) %>% 
        summarise( mean.X = mean(X),
                   sd.X = sd(X))

This doesn't give me the second intervals for each Condition:

  Condition interval  mean.X   sd.X
  <chr>     <fct>      <dbl>  <dbl>
1 A         [160,293)  0.231  0.991
2 A         NA         1.61  NA    
3 B         [190,321)  0.421  0.893
4 B         NA         0.249 NA    
5 C         [230,352)  0.193  0.898
6 C         NA         0.427 NA

Any suggestions?

CodePudding user response：

We can use rle to define "groups" of your Condition.

library(dplyr)

df %>% 
  ungroup() %>% 
  mutate(group = rep(1:length(rle(Condition)$lengths), rle(Condition)$lengths)) %>% 
  group_by(group) %>% 
  summarize(Condition = unique(Condition),
            interval = paste0("[", range(Time)[1], ",", range(Time)[2], "]"), 
            mean_X = mean(X), 
            sd_X = sd(X))

# A tibble: 6 × 5
  group Condition interval     mean_X  sd_X
  <int> <chr>     <chr>         <dbl> <dbl>
1     1 A         [160,190]    0.160  0.926
2     2 B         [190.05,230] 0.0258 0.990
3     3 C         [230.05,260] 0.296  1.03 
4     4 A         [260.05,293] 0.472  1.08 
5     5 B         [293.05,321] 0.0363 1.08 
6     6 C         [321.05,352] 0.361  1.10

CodePudding user response：

The reasons that you have the 2nd interval group with NA values is due to your input to cut function where right = F which result records with Time == max(Time) would be excluded from the interval output.

df %>%  
  group_by(Condition) %>%
  mutate(interval = cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = F, right = F)) %>%
  filter(is.na(interval))
#> # A tibble: 3 x 4
#> # Groups:   Condition [3]
#>   Condition  Time      X interval
#>   <chr>     <dbl>  <dbl> <fct>   
#> 1 A           293 -1.52  <NA>    
#> 2 B           321  1.35  <NA>    
#> 3 C           352  0.758 <NA>

As you can se aboe about there are one record having NA interval for each group. If you change cut param to right = T and include.lowest = T then you would included all of them.

df %>%  
  group_by(Condition) %>%
  mutate(interval = cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = T, right = T)) %>%
  group_by(Condition, interval) %>% 
  summarise( mean.X = mean(X),
             sd.X = sd(X))

#> # A tibble: 3 x 4
#> # Groups:   Condition [3]
#>   Condition interval  mean.X  sd.X
#>   <chr>     <fct>      <dbl> <dbl>
#> 1 A         [160,293]  0.230 0.963
#> 2 B         [190,321]  0.124 0.961
#> 3 C         [230,352]  0.146 0.961

If this is not what you expected, please clarify more on how you would like the interval to be.,

^{Created on 2022-05-16 by the reprex package (v2.0.1)}

CodePudding user response：

I definitely think there should be a less messy way to do it, but kmeans() gives the following possible solution:

library(tidyverse)

set.seed(100)
df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
  rowwise() %>%
  mutate(X = rnorm(1.4, 0.3))

df %>% 
  group_by(Condition) %>% 
  mutate(Block = kmeans(Time, 2)$cluster) %>% 
  group_by(Condition, Block) %>% 
  mutate(interval = as.character(cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = T, 
                        right = T))) %>%
  group_by(Condition, interval) %>% 
  summarise(mean.X = mean(X),
            sd.X = sd(X)) %>% 
  arrange(Condition, interval)
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups:   Condition [3]
#>   Condition interval  mean.X  sd.X
#>   <chr>     <chr>      <dbl> <dbl>
#> 1 A         [160,190]  0.382 0.819
#> 2 A         [260,293]  0.277 0.940
#> 3 B         [190,230]  0.229 1.14 
#> 4 B         [293,321]  0.303 1.08 
#> 5 C         [230,260]  0.265 0.755
#> 6 C         [321,352]  0.301 0.900

It's up to you how the NAs are dealt with.

Edit 1:

Added @Sinh Nguyen's cut improvements.

Edit 2: In response to updated question:

We can use the rleid() function from data.table

library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose

set.seed(100)
df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
  rowwise() %>%
  mutate(X = rnorm(1.4, 0.3))

Block <- rleid(df$Condition)
df %>% 
  add_column(Block) %>% 
  group_by(Condition, Block) %>% 
  mutate(interval = paste0("[", min(Time), ",", max(Time), "]")) %>%
  group_by(Condition, interval) %>% 
  summarise(mean.X = mean(X), sd.X = sd(X))
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups:   Condition [3]
#>   Condition interval     mean.X  sd.X
#>   <chr>     <chr>         <dbl> <dbl>
#> 1 A         [160,190]     0.382 0.819
#> 2 A         [260.05,293]  0.277 0.940
#> 3 B         [190.05,230]  0.229 1.14 
#> 4 B         [293.05,321]  0.303 1.08 
#> 5 C         [230.05,260]  0.265 0.755
#> 6 C         [321.05,352]  0.301 0.900