I thought I was confident in dplyr and base R but came across a problem that I can't understand. I am trying to group data and summarize a subset of the data.
However, I get wrong results which is concerning because I've been using this pattern a lot
I am trying to sum all the Sepal.Length's by their Species group however only those where the corresponding Sepal.Width is less than 3.5.
library(tidyverse)
## I am trying to sum all the Sepal.Length's by their Species group however only those where the corresponding Sepal.Width is less than 3.5.
###this produces NA in the newly created column
iris %>%
group_by(Species) %>%
summarize(new_col1=sum(Sepal.Length[iris$Sepal.Width<3.5]))
#### This produces a result, however if you focus on the 'versicolor' you see a value of 166.3
iris %>%
group_by(Species) %>%
summarize(new_col1=sum(Sepal.Length[iris$Sepal.Width<3.5],na.rm = TRUE))
#### However if you go to manually verify this amount you see a different answer (296.6)
iris %>%
filter(Species=="versicolor",
Sepal.Width<3.5) %>%
pull(Sepal.Length) %>% sum
Is there a way to subset the data when it is inside a group_by()
and summarize()
so that only the data that meets additional filter criteria (similar to my first example)
CodePudding user response:
This is an easy fix. Take the $
out of the summarize. You are calling iris
that is not grouped, so it is not giving the correct row indexes:
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise(new_col1 = sum(Sepal.Length[Sepal.Width < 3.5]))
#> # A tibble: 3 x 2
#> Species new_col1
#> <fct> <dbl>
#> 1 setosa 135.
#> 2 versicolor 297.
#> 3 virginica 307.
As a check:
iris %>%
group_by(Species) %>%
filter(Sepal.Width < 3.5) %>%
summarise(new_col1 = sum(Sepal.Length))
#> # A tibble: 3 x 2
#> Species new_col1
#> <fct> <dbl>
#> 1 setosa 135.
#> 2 versicolor 297.
#> 3 virginica 307.