I am looking for a one-line dplyr solution for filtering a data.frame on a score and afterwards calculating the relative count frequency.
The classic R solution would be the following:
daf <- data.frame(score=c(1:40),count=sample(1:1000,40,replace=T))
select <- daf$score >= 30
res <- c(sum(daf$count[select]),sum(daf$count[select])/sum(daf$count))
My first dplyr solution duplicates results:
daf %>%
mutate(total=sum(count)) %>%
filter(score >= 30) %>%
summarise(
sum_count=sum(count),
sum_rel=sum(count)/total
)
Any better ideas for the dplyr code ?
CodePudding user response:
Assuming the total
column will always be the same value, use max(total)
instead of total
in summarise
:
daf %>%
mutate(total = sum(count)) %>%
filter(score >= 30) %>%
summarise(sum_count = sum(count),
sum_rel = sum(count) / max(total))
sum_count sum_rel
1 5535 0.2621608
Edit: As per @GregorThomas's comment, using first()
instead of max()
would be more efficient for large data sets.
daf %>%
mutate(total = sum(count)) %>%
filter(score >= 30) %>%
summarise(sum_count = sum(count),
sum_rel = sum(count) / first(total))
CodePudding user response:
library(dplyr)
daf %>%
summarize(
sum_count = sum(count[score >= 30]),
sum_rel = sum_count / sum(count)
)
# sum_count sum_rel
# 1 5661 0.301053