Home > database >  Add rows for missing data in a long dataset
Add rows for missing data in a long dataset

Time:02-04

I have a dataset in long format. Every subject in the dataset was observed a number of times (from 1 to 3) during the week. The observations occurred once a day from Monday to Friday based on the subject's availability. If a case was observed less than 3 days, I need to generate empty rows for the missing observation(s). If a case was observed on Monday and Friday, the third observation should randomly occur on Tuesday, Wednesday, or Thursday. The code below works well except for one fundamental flaw: when days are randomly generated, they might overlap with days in which subjects were already observed. This should not happen. Feel free to suggest completely different code, should my solution be too complex to edit. Thanks!

#Toy dataset
dataset_long <- data.frame(
  id = c(1, 1, 2, 2, 2, 3, 3, 4, 5, 5),
  observation = c(1, 2, 1, 2, 3, 1, 2, 1, 1, 2),
  day_name = c("Monday", "Tuesday", "Monday", "Wednesday", "Thursday", "Tuesday", "Thursday", "Wednesday", "Monday", "Friday"),
  scores = sample(20:60, 10)
)

#   case observation       day scores
#1     1           1    Monday     32
#2     1           2   Tuesday     31
#3     2           1    Monday     29
#4     2           2 Wednesday     28
#5     2           3  Thursday     22
#6     3           1   Tuesday     45
#7     3           2  Thursday     30
#8     4           1 Wednesday     36
#9     5           1    Monday     58
#10    5           2    Friday     37

#Identify cases with fewer than 3 observations
cases_to_fill <- dataset_long %>%
  group_by(case) %>%
  summarize(days_observed = n()) %>%
  filter(days_observed < 3) %>%
  ungroup()

#Create vector with day names
all_days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

#Fill the missing observation day randomly
cases_to_fill_long <- cases_to_fill %>%
  mutate(missing_day = map(days_observed, ~sample(all_days, 3 - .x))) %>%
  unnest(missing_day) %>%
  mutate(observation = 3)

#Join the filled cases with the original dataset
dataset_long_filled <- dataset_long %>%
  full_join(cases_to_fill_long, by = c("case", "observation")) %>%
  arrange(case, observation)

#Coalesce the two columns of day into one
dataset_long_filled |> 
  mutate(day = coalesce(day, missing_day)) |> 
  select(-days_observed, -missing_day)

#   case observation       day scores
#1     1           1    Monday     32
#2     1           2   Tuesday     31
#3     1           3    Friday     NA
#4     2           1    Monday     29
#5     2           2 Wednesday     28
#6     2           3  Thursday     22
#7     3           1   Tuesday     45
#8     3           2  Thursday     30
#9     3           3   Tuesday     NA Tuesday is repeated for this subject
#10    4           1 Wednesday     36
#11    4           3 Wednesday     NA Wednesday is repeated for this subject
#12    4           3    Monday     NA
#13    5           1    Monday     58
#14    5           2    Friday     37
#15    5           3    Friday     NA Friday is repeated for this subject

CodePudding user response:

You can try:

library(tidyr)
library(dplyr)


dataset_long %>%
  group_by(id) %>%
  complete(day_name = sample(all_days[!(all_days %in% day_name)], 3-n())) %>%
  arrange(id, match(day_name, all_days)) %>%
  mutate(observation = row_number()) %>%
  ungroup()

# A tibble: 15 × 4
      id day_name  observation scores
   <dbl> <chr>           <int>  <int>
 1     1 Monday              1     51
 2     1 Tuesday             2     30
 3     1 Thursday            3     NA
 4     2 Monday              1     42
 5     2 Wednesday           2     53
 6     2 Thursday            3     35
 7     3 Tuesday             1     25
 8     3 Thursday            2     27
 9     3 Friday              3     NA
10     4 Monday              1     NA
11     4 Wednesday           2     59
12     4 Thursday            3     NA
13     5 Monday              1     50
14     5 Thursday            2     NA
15     5 Friday              3     54
  • Related