Home > Net >  subsetting and summarizeing data in group_by() produces wrong results
subsetting and summarizeing data in group_by() produces wrong results

Time:09-21

I thought I was confident in dplyr and base R but came across a problem that I can't understand. I am trying to group data and summarize a subset of the data.

However, I get wrong results which is concerning because I've been using this pattern a lot

I am trying to sum all the Sepal.Length's by their Species group however only those where the corresponding Sepal.Width is less than 3.5.

library(tidyverse)

## I am trying to sum all the Sepal.Length's by their Species group however only those where the corresponding Sepal.Width is less than 3.5.

###this produces NA in the newly created column

iris %>% 
  group_by(Species) %>% 
  summarize(new_col1=sum(Sepal.Length[iris$Sepal.Width<3.5]))

#### This produces a result,  however if you focus on the 'versicolor' you see a value of 166.3 

iris %>% 
  group_by(Species) %>% 
  summarize(new_col1=sum(Sepal.Length[iris$Sepal.Width<3.5],na.rm = TRUE))


#### However if you go to manually verify this amount you see a different answer (296.6)

iris %>% 
  filter(Species=="versicolor",
         Sepal.Width<3.5) %>% 
  pull(Sepal.Length) %>% sum

Is there a way to subset the data when it is inside a group_by() and summarize() so that only the data that meets additional filter criteria (similar to my first example)

CodePudding user response:

This is an easy fix. Take the $ out of the summarize. You are calling iris that is not grouped, so it is not giving the correct row indexes:

library(tidyverse)

iris %>% 
  group_by(Species) %>% 
  summarise(new_col1 = sum(Sepal.Length[Sepal.Width < 3.5]))
#> # A tibble: 3 x 2
#>   Species    new_col1
#>   <fct>         <dbl>
#> 1 setosa         135.
#> 2 versicolor     297.
#> 3 virginica      307.

As a check:


iris %>% 
  group_by(Species) %>% 
  filter(Sepal.Width < 3.5) %>%
  summarise(new_col1 = sum(Sepal.Length))
#> # A tibble: 3 x 2
#>   Species    new_col1
#>   <fct>         <dbl>
#> 1 setosa         135.
#> 2 versicolor     297.
#> 3 virginica      307.
  • Related