Home > Net >  Subsetting or filtering within dplyr::summarise
Subsetting or filtering within dplyr::summarise

Time:05-20

There are multiple similar questions on this but not the same problem

MWE:

library(dplyr)
library(lubridate)

df= data.frame(id = c(1:5),
               type = c("a", "b", "b", "a", "b"),
               start = dmy(c("05/05/2005","06/06/2006", "07/07/2007", "08/08/2008", "09/09/2009")),
               finish = dmy(c("08/08/2008", "09/09/2009","02/02/2011","02/02/2011", NA)),
               not_used = c(F,T,F,T,F))

I want to produce a summary, grouped by type, including the total number of not_used by type and the mean difference between start and finish in months when not_used is False. This is how I'm trying:

df%>%group_by(type)%>%
  summarise(Not_used =  sum(not_used),
            `Mean_Lifespan_of_used(months)` = mean((interval(start,finish)/months(1), na.rm= T)[not_used == F]))

With this I'm getting unexpected token ',', resulting in error:

Error: unexpected ',' in:
"  summarise(Not_used =  sum(not_used),
            `Mean_Lifespan_of_used(months)` = mean((interval(start,finish)/months(1),"

I appreciate I could create a new column before the summarise function, but I'd like to understand what I'm doing wrong here.

CodePudding user response:

The input data as.Date needs format = "%d/%m/%Y" and then subset the output before doing the mean (as mean returns a single value whereas not_used length is different)

library(dplyr)
library(lubridate)
df%>%
   group_by(type)%>%
  summarise(Not_used =  sum(not_used),  
   `Mean_Lifespan_of_used(months)` = mean((interval(start, 
         finish)/months(1))[not_used == FALSE], na.rm = TRUE))

-output

# A tibble: 2 × 3
  type  Not_used `Mean_Lifespan_of_used(months)`
  <chr>    <int>                           <dbl>
1 a            1                            39.1
2 b            1                            42.8
  • Related