I am working with the R programming language. I have the following data:
library(dplyr)
my_data = data.frame(id = c(1,1,1,1,2,2,2,3,4,4,5,5,5,5,5), var_1 = sample(c(0,1), 15, replace = TRUE) , var_2 =sample(c(0,1), 15 , replace = TRUE) )
my_data = data.frame(my_data %>% group_by(id) %>% mutate(index = row_number(id)))
my_data = my_data[,c(1,4,2,3)]
The data looks something like this:
id index var_1 var_2
1 1 1 0 1
2 1 2 0 0
3 1 3 1 1
4 1 4 0 1
5 2 1 1 0
6 2 2 1 1
7 2 3 0 1
8 3 1 1 0
9 4 1 0 0
10 4 2 0 0
11 5 1 0 0
12 5 2 1 0
13 5 3 0 1
14 5 4 0 0
15 5 5 0 1
I want to create two new variables (v_1, v_2). For each unique "id":
v_1: I want v_1 to be the average value of the current, previous and previous-to-previous values of var_1 (i.e. index = n, index = n-1 and index = n-2). When this is not possible (e.g. for index = 2 and index = 1), I want this average to be for as back as you can go.
v_2: I want v_2 to be the average value of the current, previous and previous-to-previous values of var_2 (i.e. index = n, index = n-1 and index = n-2). When this is not possible (e.g. for index = 2 and index = 1), I want this average to be for as back as you can go.
This would be something like this:
- row 1 (id = 1, index = 1) : v_1 = var_1 (index 1)
- row 2 (id = 1, index = 1 ; id = 1 index = 2) : v_1 = (var_1 (index 1) var_1 (index 2))/2
- row 3 (id = 1, index = 1 ; id = 1 index = 2; id = 1, index = 3) : v_1 = (var_1 (index 1) var_1 (index 2) var_1 (index 3)) /3
- row 4 (id = 1, index = 2 ; id = 1 index = 3; id = 1, index = 4) : v_1 = (var_1 (index 2) var_1 (index 3) var_1 (index 4)) /3
- etc.
I tried to do this with the following code:
average_data = my_data %>%
group_by(id) %>%
summarise(v_1 = mean(tail(var_1, 3)),
v_2 = mean(tail(var_2, 3)))
# final_result
final_data = merge(x = my_data, y = average_data, by = "id", all.x = TRUE)
But I am not sure if this is correct.
Can someone please show me how to do this?
Thanks!
CodePudding user response:
You could create a function that acomplishes this:
library(tidyverse)
fun <- function(x, k){
y <- cummean(first(x, k-1))
if(k > length(x)) y else c(y, zoo::rollmean(x, k))
}
df %>%
group_by(id) %>%
mutate(v_1 = fun(var_1, 3), v_2 = fun(var_2, 3))
# Groups: id [5]
id index var_1 var_2 v_1 v_2
<int> <int> <int> <int> <dbl> <dbl>
1 1 1 0 1 0 1
2 1 2 0 0 0 0.5
3 1 3 1 1 0.333 0.667
4 1 4 0 1 0.333 0.667
5 2 1 1 0 1 0
6 2 2 1 1 1 0.5
7 2 3 0 1 0.667 0.667
8 3 1 1 0 1 0
9 4 1 0 0 0 0
10 4 2 0 0 0 0
11 5 1 0 0 0 0
12 5 2 1 0 0.5 0
13 5 3 0 1 0.333 0.333
14 5 4 0 0 0.333 0.333
15 5 5 0 1 0 0.667
CodePudding user response:
data
df <- data.frame(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 5L, 5L),
index = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L, 1L, 2L, 3L, 4L, 5L),
var_1 = c(0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L),
var_2 = c(1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L)
)
tidyverse
library(tidyverse)
df %>%
group_by(id) %>%
mutate(across(starts_with("var_"),
.fns = ~zoo::rollapply(data = .x, width = 3, FUN = mean, partial = TRUE, align = "right"),
.names = "new_{.col}")) %>%
ungroup()
#> # A tibble: 15 × 6
#> id index var_1 var_2 new_var_1 new_var_2
#> <int> <int> <int> <int> <dbl> <dbl>
#> 1 1 1 0 1 0 1
#> 2 1 2 0 0 0 0.5
#> 3 1 3 1 1 0.333 0.667
#> 4 1 4 0 1 0.333 0.667
#> 5 2 1 1 0 1 0
#> 6 2 2 1 1 1 0.5
#> 7 2 3 0 1 0.667 0.667
#> 8 3 1 1 0 1 0
#> 9 4 1 0 0 0 0
#> 10 4 2 0 0 0 0
#> 11 5 1 0 0 0 0
#> 12 5 2 1 0 0.5 0
#> 13 5 3 0 1 0.333 0.333
#> 14 5 4 0 0 0.333 0.333
#> 15 5 5 0 1 0 0.667
Created on 2022-06-06 by the reprex package (v2.0.1)
data.table
library(data.table)
COLS <- gsub("ar", "", grep("var_", names(df), value = TRUE))
setDT(df)[,
(COLS) := lapply(.SD, function(x) zoo::rollapply(data = x, width = 3, FUN = mean, partial = TRUE, align = "right")),
by = id,
.SDcols = patterns("var_")][]
#> id index var_1 var_2 v_1 v_2
#> 1: 1 1 0 1 0.0000000 1.0000000
#> 2: 1 2 0 0 0.0000000 0.5000000
#> 3: 1 3 1 1 0.3333333 0.6666667
#> 4: 1 4 0 1 0.3333333 0.6666667
#> 5: 2 1 1 0 1.0000000 0.0000000
#> 6: 2 2 1 1 1.0000000 0.5000000
#> 7: 2 3 0 1 0.6666667 0.6666667
#> 8: 3 1 1 0 1.0000000 0.0000000
#> 9: 4 1 0 0 0.0000000 0.0000000
#> 10: 4 2 0 0 0.0000000 0.0000000
#> 11: 5 1 0 0 0.0000000 0.0000000
#> 12: 5 2 1 0 0.5000000 0.0000000
#> 13: 5 3 0 1 0.3333333 0.3333333
#> 14: 5 4 0 0 0.3333333 0.3333333
#> 15: 5 5 0 1 0.0000000 0.6666667
Created on 2022-06-06 by the reprex package (v2.0.1)