Home > Mobile >  How to efficiently calculate the sum and relative count frequency?
How to efficiently calculate the sum and relative count frequency?

Time:12-18

I am looking for a one-line dplyr solution for filtering a data.frame on a score and afterwards calculating the relative count frequency.

The classic R solution would be the following:

daf <- data.frame(score=c(1:40),count=sample(1:1000,40,replace=T))
select <- daf$score >= 30
res <- c(sum(daf$count[select]),sum(daf$count[select])/sum(daf$count))

My first dplyr solution duplicates results:

daf %>% 
  mutate(total=sum(count)) %>% 
  filter(score >= 30) %>% 
  summarise(
    sum_count=sum(count),
    sum_rel=sum(count)/total
  )

Any better ideas for the dplyr code ?

CodePudding user response:

Assuming the total column will always be the same value, use max(total) instead of total in summarise:

daf %>% 
  mutate(total = sum(count)) %>%
  filter(score >= 30) %>%
  summarise(sum_count = sum(count),
            sum_rel = sum(count) / max(total))

  sum_count   sum_rel
1      5535 0.2621608

Edit: As per @GregorThomas's comment, using first() instead of max() would be more efficient for large data sets.

daf %>% 
  mutate(total = sum(count)) %>%
  filter(score >= 30) %>%
  summarise(sum_count = sum(count),
            sum_rel = sum(count) / first(total))

CodePudding user response:

library(dplyr)
daf %>%
  summarize(
    sum_count = sum(count[score >= 30]), 
    sum_rel = sum_count / sum(count)
  )
#   sum_count  sum_rel
# 1      5661 0.301053
  • Related