I want to do grouping on a dataset that contains more than level of grouping. However, the dataset that I have started doing weighting for the variables after certain date.
Therefore, I want to sum over one variable but after a specific date. Before that date I want the value to stay the same for that variable. as follows,
score <- data.frame(
id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
interval = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
category = c(1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5,
1,1,2,2,3,3,4,4,5,5,1,1,2,2,3,3,4,4,5,5),
subcategory = c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,
1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2),
result = c(92,92,88,88,78,78,78,78,98,98,82,82,84,84,75,
75,86,86, 64,64,95,95,96,96,63,63,96,96,69,69,78,78,
88,88, 96,96,69,69,96,96))
Since each grouping level can have several inputs, only one value is need from each group. Therefore, I only chose one value then I group.
Here are my attempts:
S1 <- score %>%
distinct(id, interval,category, .keep_all = TRUE)%>%
group_by(id, interval) %>%
do(if(.$interval > 1)){summarize(sumresult = sum(result), .groups = 'drop')} else{.$interval})
I keep receiving the following message:
Error in summarize(sumresult = sum(result), .groups = "drop") :
object 'result' not found
In addition: There were 36 warnings (use warnings() to see them)
I also tried ifelse
but it is not working either
How do I include if statement after group_by
with the summarize statement inside the if condition?!
Thank you
CodePudding user response:
You haven't shared the expected output that you are looking for but based on your attempt I think you may try -
library(dplyr)
score %>%
distinct(id, interval,category, .keep_all = TRUE)%>%
group_by(id, interval) %>%
summarise(sumresult = if(all(interval > 1)) sum(result) else result, .groups = 'drop')
# id interval sumresult
# <dbl> <dbl> <dbl>
# 1 1 1 92
# 2 1 1 88
# 3 1 1 78
# 4 1 1 78
# 5 1 1 98
# 6 1 2 391
# 7 2 1 95
# 8 2 1 96
# 9 2 1 63
#10 2 1 96
#11 2 1 69
#12 2 2 427
If interval > 1
then we sum
the result
which gives 1 value else keep result
as it is which returns more than 1 value. That is the reason why you have 1 row for interval = 2
and multiple rows for interval = 1
in the output.