I have a dataset in long format. Every subject in the dataset was observed a number of times (from 1 to 3) during the week. The observations occurred once a day from Monday to Friday based on the subject's availability. If a case was observed less than 3 days, I need to generate empty rows for the missing observation(s). If a case was observed on Monday and Friday, the third observation should randomly occur on Tuesday, Wednesday, or Thursday. The code below works well except for one fundamental flaw: when days are randomly generated, they might overlap with days in which subjects were already observed. This should not happen. Feel free to suggest completely different code, should my solution be too complex to edit. Thanks!
#Toy dataset
dataset_long <- data.frame(
id = c(1, 1, 2, 2, 2, 3, 3, 4, 5, 5),
observation = c(1, 2, 1, 2, 3, 1, 2, 1, 1, 2),
day_name = c("Monday", "Tuesday", "Monday", "Wednesday", "Thursday", "Tuesday", "Thursday", "Wednesday", "Monday", "Friday"),
scores = sample(20:60, 10)
)
# case observation day scores
#1 1 1 Monday 32
#2 1 2 Tuesday 31
#3 2 1 Monday 29
#4 2 2 Wednesday 28
#5 2 3 Thursday 22
#6 3 1 Tuesday 45
#7 3 2 Thursday 30
#8 4 1 Wednesday 36
#9 5 1 Monday 58
#10 5 2 Friday 37
#Identify cases with fewer than 3 observations
cases_to_fill <- dataset_long %>%
group_by(case) %>%
summarize(days_observed = n()) %>%
filter(days_observed < 3) %>%
ungroup()
#Create vector with day names
all_days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
#Fill the missing observation day randomly
cases_to_fill_long <- cases_to_fill %>%
mutate(missing_day = map(days_observed, ~sample(all_days, 3 - .x))) %>%
unnest(missing_day) %>%
mutate(observation = 3)
#Join the filled cases with the original dataset
dataset_long_filled <- dataset_long %>%
full_join(cases_to_fill_long, by = c("case", "observation")) %>%
arrange(case, observation)
#Coalesce the two columns of day into one
dataset_long_filled |>
mutate(day = coalesce(day, missing_day)) |>
select(-days_observed, -missing_day)
# case observation day scores
#1 1 1 Monday 32
#2 1 2 Tuesday 31
#3 1 3 Friday NA
#4 2 1 Monday 29
#5 2 2 Wednesday 28
#6 2 3 Thursday 22
#7 3 1 Tuesday 45
#8 3 2 Thursday 30
#9 3 3 Tuesday NA Tuesday is repeated for this subject
#10 4 1 Wednesday 36
#11 4 3 Wednesday NA Wednesday is repeated for this subject
#12 4 3 Monday NA
#13 5 1 Monday 58
#14 5 2 Friday 37
#15 5 3 Friday NA Friday is repeated for this subject
CodePudding user response:
You can try:
library(tidyr)
library(dplyr)
dataset_long %>%
group_by(id) %>%
complete(day_name = sample(all_days[!(all_days %in% day_name)], 3-n())) %>%
arrange(id, match(day_name, all_days)) %>%
mutate(observation = row_number()) %>%
ungroup()
# A tibble: 15 × 4
id day_name observation scores
<dbl> <chr> <int> <int>
1 1 Monday 1 51
2 1 Tuesday 2 30
3 1 Thursday 3 NA
4 2 Monday 1 42
5 2 Wednesday 2 53
6 2 Thursday 3 35
7 3 Tuesday 1 25
8 3 Thursday 2 27
9 3 Friday 3 NA
10 4 Monday 1 NA
11 4 Wednesday 2 59
12 4 Thursday 3 NA
13 5 Monday 1 50
14 5 Thursday 2 NA
15 5 Friday 3 54