Home > Mobile >  Including missing values in summarise output
Including missing values in summarise output

Time:04-21

I am trying to still keep all rows in a summarise output even when one of the columns does not exist. I have a data frame that looks like this:

dat <- data.frame(id=c(1,1,2,2,2,3),
                  seq_num=c(0:1,0:2,0:0),
                  time=c(4,5,6,7,8,9))

I then need to summarize by all ids, where id is a row and there is a column for the first seq_num and second one. Even if the second one doesn't exist, I'd still like that row to be maintained, with an NA in that slot. I've tried the answers in this answer, but they are not working.

dat %>% 
  group_by(id, .drop=FALSE) %>% 
  summarise(seq_0_time = time[seq_num==0],
            seq_1_time = time[seq_num==1])

outputs

     id seq_0_time seq_1_time
  <dbl>      <dbl>      <dbl>
1     1          4          5
2     2          6          7

I would still like a 3rd row, though, with seq_0_time=9, and seq_1_time=NA since it doesn't exist.

How can I do this?

CodePudding user response:

This can actually be pretty easily solved using reshape.

> reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide')
  id time.0 time.1 time.2
1  1      4      5     NA
3  2      6      7      8
6  3      9     NA     NA

CodePudding user response:

My understanding is that you must use complete() on both the seq_num and id variables to achieve your desired result:

library(tidyverse)
dat <- data.frame(id=c(1,1,2,2,2,3),
                  seq_num=c(0:1,0:2,0:0),
                  time=c(4,5,6,7,8,9)) %>%
  complete(seq_num = seq_num,
           id = id)
dat %>% 
  group_by(id, .drop=FALSE) %>% 
  summarise(seq_0_time = time[seq_num==0],
            seq_1_time = time[seq_num==1])
#> # A tibble: 3 x 3
#>      id seq_0_time seq_1_time
#>   <dbl>      <dbl>      <dbl>
#> 1     1          4          5
#> 2     2          6          7
#> 3     3          9         NA

Created on 2022-04-20 by the reprex package (v2.0.1)

  • Related