Home > OS >  Using ifelse in summarize after group_by in dplyr
Using ifelse in summarize after group_by in dplyr

Time:10-31

I want to do grouping on a dataset that contains more than level of grouping. However, the dataset that I have started doing weighting for the variables after certain date.

Therefore, I want to sum over one variable but after a specific date. Before that date I want the value to stay the same for that variable. as follows,

score <- data.frame(
  id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2), 
  interval = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,
    1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2), 
  category = c(1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5,
    1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5), 
  subcategory = c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,
    1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2), 
  result = c(92,92,88,88,78,78,78,78,98,98,82,82,84,84,75,
    75,86,86, 64,64,95,95,96,96,63,63,96,96,69,69,78,78,
    88,88, 96,96,69,69,96,96))

Since each grouping level can have several inputs, only one value is need from each group. Therefore, I only chose one value then I group.

Here are my attempts:

S1 <- score %>%
  distinct(id, interval,category, .keep_all = TRUE)%>%
  group_by(id, interval) %>%
  do(if(.$interval > 1)){summarize(sumresult = sum(result), .groups = 'drop')} else{.$interval})

I keep receiving the following message:

Error in summarize(sumresult = sum(result), .groups = "drop") : 
  object 'result' not found
In addition: There were 36 warnings (use warnings() to see them)

I also tried ifelse but it is not working either

How do I include if statement after group_by with the summarize statement inside the if condition?!

Thank you

CodePudding user response:

You haven't shared the expected output that you are looking for but based on your attempt I think you may try -

library(dplyr)

score %>%
  distinct(id, interval,category, .keep_all = TRUE)%>%
  group_by(id, interval) %>%
  summarise(sumresult = if(all(interval > 1)) sum(result) else result, .groups = 'drop')

#      id interval sumresult
#   <dbl>    <dbl>     <dbl>
# 1     1        1        92
# 2     1        1        88
# 3     1        1        78
# 4     1        1        78
# 5     1        1        98
# 6     1        2       391
# 7     2        1        95
# 8     2        1        96
# 9     2        1        63
#10     2        1        96
#11     2        1        69
#12     2        2       427

If interval > 1 then we sum the result which gives 1 value else keep result as it is which returns more than 1 value. That is the reason why you have 1 row for interval = 2 and multiple rows for interval = 1 in the output.

  • Related