I have a data frame like this:
library(dplyr)
name <- c("Bob", "Bob", "Bob", "Bob", "John", "John", "John")
count <- c(2, 3, 4, 5, 2, 3, 4)
score <- c(5, NA, NA, NA, 3, 4, 2)
my_df <- data.frame(cbind(name, count, score)) %>%
mutate(count = as.numeric(count),
score = as.numeric(score))
my_df
name count score
1 Bob 2 5
2 Bob 3 NA
3 Bob 4 NA
4 Bob 5 NA
5 John 2 3
6 John 3 4
7 John 4 2
Then I create another column by taking the product between count and score:
my_df %>%
mutate(product = count*score)
name count score product
1 Bob 2 5 10
2 Bob 3 NA NA
3 Bob 4 NA NA
4 Bob 5 NA NA
5 John 2 3 6
6 John 3 4 12
7 John 4 2 8
I want to group by name and aggregate for the sum(product)/sum(count) but I want the sum of product column to ignore any NA values in the sum (I did this below) AND I want any associated count values to be ignored in the summation. This is my current solution, but it is not right. Bob's result is calculated as 10/(2 3 4 5) = 0.71 but I want Bob's result to be 10/2 = 5.
my_df %>%
mutate(product = count*score)
group_by(name) %>%
summarize(result = sum(product, na.rm = TRUE)/sum(count))
name result
<chr> <dbl>
1 Bob 0.714
2 John 2.89
CodePudding user response:
We may need to subset the count
by the non-NA values in 'product'
library(dplyr)
my_df %>%
mutate(product = count*score) %>%
group_by(name) %>%
summarise(result = sum(product, na.rm = TRUE)/sum(count[!is.na(product)]))
-output
# A tibble: 2 × 2
name result
<chr> <dbl>
1 Bob 5
2 John 2.89
Or do a filter
before the grouping
my_df %>%
filter(complete.cases(score)) %>%
group_by(name) %>%
summarise(result = sum(score * count)/sum(count))
# A tibble: 2 × 2
name result
<chr> <dbl>
1 Bob 5
2 John 2.89