I am trying to summarize demographic information of a dataframe and I am running into some issues. Breaking it down by gender, there are 4 possible options that participants can choose from: 1,2,3,4 with blanks (no response) being treated as NA values by R. I am getting the correct counts for each gender but when trying to obtain the mean of each gender is where I am running into issues.
I'd like to keep the observations with NA values because while they may not have answered demographic information, they have answered other questions hence why I do not want to simply remove those rows from the dataframe.
Here is what I tried
#df$q10: what is your gender
by_gender = df %>%
group_by(df$Q10) %>%
dplyr::summarize(count = n(),
AvgAge = mean(df$Q11_1_TEXT, na.rm = TRUE))
by_gender
This returns the same value for all genders as
mean(df$Q11_1_TEXT, na.rm = TRUE)
Both the gender and age columns have NA values and I suspect this is where the issue may be? I tried adding na.rm = T but that does not seem to work. What else can I try?
Edit: Removing $
makes the function work as expected.
CodePudding user response:
When you ask for mean(df$Q11_1_TEXT)
it will calculate a mean from the original ungrouped vector, whereas if you use mean(Q11_1_TEXT)
it will look for Q11_1_TEXT within the grouped data frame it received from the prior step.
Compare:
mtcars %>%
group_by(gear) %>%
summarize(wt_ttl = sum(wt),
wt_ttl2 = sum(mtcars$wt))
# A tibble: 3 × 3
gear wt_ttl wt_ttl2
<dbl> <dbl> <dbl>
1 3 58.4 103.
2 4 31.4 103.
3 5 13.2 103.