I`m struggling on how can I take the measure of 4 rows sequentially for multiples variables in my dataframe sorting by the date. Here is a toy example
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A","A","A","A","A" , "C"),
measure1 = c(10:24),
measure2 = c(1:15),
time= c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021",
"20-03-2021", "20-10-2021", "29-06-2021", "20-07-2021", "13-06-2021"))
# my_groups measure1 measure2 time
# 1 A 18 9 15-01-2021
# 2 A 19 10 15-01-2021
# 3 A 12 3 19-09-2020
# 4 A 20 11 20-03-2021
# 5 A 23 14 20-07-2021
# 6 A 10 1 20-09-2020
# 7 A 21 12 20-10-2021
# 8 A 11 2 25-09-2020
# 9 A 22 13 29-06-2021
# 10 B 13 4 20-05-2020
# 11 B 14 5 20-06-2021
# 12 C 15 6 11-01-2021
# 13 C 16 7 13-01-2021
# 14 C 17 8 13-01-2021
# 15 C 24 15 13-06-2021
As result I'd like something like:
# my_groups measure1 measure2 time
# 1 A 17.25 8.25 20-03-2021 #mean for the first 4 elements of A and maximun date
# 2 A 16.25 7.25 25-09-2020 #mean for the others 4 elements of A and maximun date
# 3 C 18 9 13-06-2021 #mean for the first 4 elements of B and maximun date
To get this result I thought of using something like this:
test %>%
arrange( my_groups,time) %>%
group_by(my_groups) %>%
summarise(measure1 = mean(measure1),
measure2 = mean(measure2),
time = max(time))
But I'm having problem to find a solution to take these measures for 4 values consecutively.
Any hint on how can I do that?
CodePudding user response:
We can add another grouping variable g4
using integer division to ensure that we get groups of 4, and if desired, drop groups with fewer members. Then just run your summarize
.
library(dplyr)
test %>%
arrange(my_groups, time) %>%
group_by(my_groups) %>%
mutate(g4 = (row_number() - 1) %/% 4) %>%
group_by(my_groups, g4) %>%
filter(n() == 4) %>%
summarise(measure1 = mean(measure1),
measure2 = mean(measure2),
time = max(time),
.groups = "drop")
#> # A tibble: 3 × 5
#> my_groups g4 measure1 measure2 time
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 0 17.2 8.25 20-03-2021
#> 2 A 1 16.2 7.25 25-09-2020
#> 3 C 0 18 9 13-06-2021
CodePudding user response:
Function "f" creates a sliding window (if you do windowing function much, I really recommend the slider package). Function "g" simply applies the function to all numeric windows. The final group_modify step applies the functions to the data and filters the good data.
library(slider)
library(lubridate)
library(dplyr)
library(tidyr)
library(purrr)
test = data.frame(my_groups = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A","A","A","A","A" , "C"),
measure1 = c(10:24),
measure2 = c(1:15),
time= dmy(c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021",
"20-03-2021", "20-10-2021", "29-06-2021", "20-07-2021", "13-06-2021"))) %>%
arrange(my_groups, time) %>%
group_by(my_groups)
f <- function(x) {t <- slide_mean(x, before=3, complete=T, step = 4)}
g <- function(y) {s <- y %>% map_if(is.numeric, ~ f(.x)) %>% bind_cols()}
x <- test %>% group_modify(~ g(.x)) %>% filter(!(is.na(measure1) | is.na(measure2)))
my_groups measure1 measure2 time
<chr> <dbl> <dbl> <date>
1 A 12.8 3.75 2021-01-15
2 A 21 12 2021-07-20
3 C 18 9 2021-06-13
The differences between my and the answers above are because I translated the dates from character, so they sorted differently than as text.