Home > Mobile >  Summarizing by group using dplyr not working as expected
Summarizing by group using dplyr not working as expected

Time:12-01

I am trying to summarize demographic information of a dataframe and I am running into some issues. Breaking it down by gender, there are 4 possible options that participants can choose from: 1,2,3,4 with blanks (no response) being treated as NA values by R. I am getting the correct counts for each gender but when trying to obtain the mean of each gender is where I am running into issues.

I'd like to keep the observations with NA values because while they may not have answered demographic information, they have answered other questions hence why I do not want to simply remove those rows from the dataframe.

Here is what I tried

#df$q10: what is your gender

by_gender = df %>% 
   group_by(df$Q10)  %>% 
   dplyr::summarize(count = n(), 
                    AvgAge = mean(df$Q11_1_TEXT, na.rm = TRUE))

by_gender

This returns the same value for all genders as

mean(df$Q11_1_TEXT, na.rm = TRUE)

Both the gender and age columns have NA values and I suspect this is where the issue may be? I tried adding na.rm = T but that does not seem to work. What else can I try?

Edit: Removing $ makes the function work as expected.

CodePudding user response:

When you ask for mean(df$Q11_1_TEXT) it will calculate a mean from the original ungrouped vector, whereas if you use mean(Q11_1_TEXT) it will look for Q11_1_TEXT within the grouped data frame it received from the prior step.

Compare:

mtcars %>% 
  group_by(gear) %>% 
  summarize(wt_ttl = sum(wt), 
            wt_ttl2 = sum(mtcars$wt))

# A tibble: 3 × 3
   gear wt_ttl wt_ttl2
  <dbl>  <dbl>   <dbl>
1     3   58.4    103.
2     4   31.4    103.
3     5   13.2    103.
  • Related