Home > Software engineering >  (R) - Fill missing integer range by group with dplyr
(R) - Fill missing integer range by group with dplyr

Time:03-01

I am using R studio, mostly dplyr processing, where I have a df of users (A,B,C,...) and what day since their first visit they were active (1,2,3,...)

user day active
A 1 T
A 3 T
B 2 T
B 4 T

I would like to complete this list with all missing days - up to their current maximum value (so for user B until 4 and for user A until 3) - and value FALSE:

user day active
A 1 T
A 2 F
A 3 T
B 1 F
B 2 T
B 3 F
B 4 T

I've been googling and chewing on this for hours now. Anybody have an idea?

CodePudding user response:

We could group by 'user' and then get the sequence from min to max of 'day' in complete to expand the data while filling the 'active' column with FALSE (by default missing combinations are filled with NA)

library(dplyr)
library(tidyr)
df1 %>% 
  group_by(user) %>% 
  complete(day = min(day):max(day), fill = list(active = FALSE)) %>% 
  ungroup

-output

# A tibble: 6 × 3
  user    day active
  <chr> <int> <lgl> 
1 A         1 TRUE  
2 A         2 FALSE 
3 A         3 TRUE  
4 B         2 TRUE  
5 B         3 FALSE 
6 B         4 TRUE  

data

df1 <- structure(list(user = c("A", "A", "B", "B"), day = c(1L, 3L, 
2L, 4L), active = c(TRUE, TRUE, TRUE, TRUE)), class = "data.frame", 
row.names = c(NA, 
-4L))

CodePudding user response:

You can create a new dataframe of users and days for all users and all days and then join it to your existing dataframe then set the active column. Something like this:

fullDf <- data.frame("user" = c(rep("A", 4), rep("B", 4)), 
           "day" = rep(1:4, 2))

existingDf <- left_join(fullDf, existingDf, by = c("user", "day"))
existingDf$active <- ifelse(is.na(existingDf$active), FALSE, existingDf$active
  • Related