I am trying to summarize data within time intervals using a data set with conditions repeated over time at varying intervals. I would like to get means and standard deviations within time intervals for each of the conditions.
Some data.However, in my real data I don't know how many intervals of each condition there will be. I thought perhaps I could indicate the end of an interval by a change in Condition from one row to the next row. But I don't know how to code that.
library(tidyverse)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
I'm trying to calculate mean(X) and sd(X) for each interval of Condition (made up numbers):
Condition interval mean(X) sd(X)
A [160,190] 1.4 0.32
B [190.05,230] 1.46 0.36
C [230.05,260] 1.32 0.26
A [260.05,293] 1.5 0.40
B [293.05,321] 1.25 0.34
C [321.05,352] 1.43 0.41
I've tried this, but it doesn't do what I need:
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = F,
right = F)) %>%
group_by(Condition, interval) %>%
summarise( mean.X = mean(X),
sd.X = sd(X))
This doesn't give me the second intervals for each Condition:
Condition interval mean.X sd.X
<chr> <fct> <dbl> <dbl>
1 A [160,293) 0.231 0.991
2 A NA 1.61 NA
3 B [190,321) 0.421 0.893
4 B NA 0.249 NA
5 C [230,352) 0.193 0.898
6 C NA 0.427 NA
Any suggestions?
CodePudding user response:
We can use rle
to define "groups" of your Condition.
library(dplyr)
df %>%
ungroup() %>%
mutate(group = rep(1:length(rle(Condition)$lengths), rle(Condition)$lengths)) %>%
group_by(group) %>%
summarize(Condition = unique(Condition),
interval = paste0("[", range(Time)[1], ",", range(Time)[2], "]"),
mean_X = mean(X),
sd_X = sd(X))
# A tibble: 6 × 5
group Condition interval mean_X sd_X
<int> <chr> <chr> <dbl> <dbl>
1 1 A [160,190] 0.160 0.926
2 2 B [190.05,230] 0.0258 0.990
3 3 C [230.05,260] 0.296 1.03
4 4 A [260.05,293] 0.472 1.08
5 5 B [293.05,321] 0.0363 1.08
6 6 C [321.05,352] 0.361 1.10
CodePudding user response:
The reasons that you have the 2nd interval group with NA values is due to your input to cut
function where right = F
which result records with Time == max(Time)
would be excluded from the interval output.
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = F, right = F)) %>%
filter(is.na(interval))
#> # A tibble: 3 x 4
#> # Groups: Condition [3]
#> Condition Time X interval
#> <chr> <dbl> <dbl> <fct>
#> 1 A 293 -1.52 <NA>
#> 2 B 321 1.35 <NA>
#> 3 C 352 0.758 <NA>
As you can se aboe about there are one record having NA
interval for each group.
If you change cut
param to right = T
and include.lowest = T
then you would included all of them.
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = T, right = T)) %>%
group_by(Condition, interval) %>%
summarise( mean.X = mean(X),
sd.X = sd(X))
#> # A tibble: 3 x 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <fct> <dbl> <dbl>
#> 1 A [160,293] 0.230 0.963
#> 2 B [190,321] 0.124 0.961
#> 3 C [230,352] 0.146 0.961
If this is not what you expected, please clarify more on how you would like the interval to be.,
Created on 2022-05-16 by the reprex package (v2.0.1)
CodePudding user response:
I definitely think there should be a less messy way to do it, but kmeans()
gives the following possible solution:
library(tidyverse)
set.seed(100)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
df %>%
group_by(Condition) %>%
mutate(Block = kmeans(Time, 2)$cluster) %>%
group_by(Condition, Block) %>%
mutate(interval = as.character(cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = T,
right = T))) %>%
group_by(Condition, interval) %>%
summarise(mean.X = mean(X),
sd.X = sd(X)) %>%
arrange(Condition, interval)
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <chr> <dbl> <dbl>
#> 1 A [160,190] 0.382 0.819
#> 2 A [260,293] 0.277 0.940
#> 3 B [190,230] 0.229 1.14
#> 4 B [293,321] 0.303 1.08
#> 5 C [230,260] 0.265 0.755
#> 6 C [321,352] 0.301 0.900
It's up to you how the NA
s are dealt with.
Edit 1:
Added @Sinh Nguyen's cut
improvements.
Edit 2: In response to updated question:
We can use the rleid()
function from data.table
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
set.seed(100)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
Block <- rleid(df$Condition)
df %>%
add_column(Block) %>%
group_by(Condition, Block) %>%
mutate(interval = paste0("[", min(Time), ",", max(Time), "]")) %>%
group_by(Condition, interval) %>%
summarise(mean.X = mean(X), sd.X = sd(X))
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <chr> <dbl> <dbl>
#> 1 A [160,190] 0.382 0.819
#> 2 A [260.05,293] 0.277 0.940
#> 3 B [190.05,230] 0.229 1.14
#> 4 B [293.05,321] 0.303 1.08
#> 5 C [230.05,260] 0.265 0.755
#> 6 C [321.05,352] 0.301 0.900