dplyr::mutate: Puzzling outcome of sum()-CodePudding

upon learning some data science tools in R, I stumbled upon the following "error": (this is code from chapter 12 of "R for Data Science" by Wickham & Grolemund):

library(forcats)
library(tidyverse)

by_age <- gss_cat %>% 
  filter(!is.na(age)) %>% 
  group_by(age, marital) %>% 
  count() %>%  
  mutate(prop= n/sum(n))

In contrast to the statement in the book, I incorrectly get a prop column containing only 1 (e.g. 100%) for each row. Now, I have tried to modify this example and got the following additional information:

If I am more explicit via calling base::sum instead of sum the error still remains
If I compute the sum outside of the pipe in an extra statement and then use the computed number in the statement (i.e. instead of n/sum(n) -> n/21407), I get the correct proportions.

Can anyone maybe help me and explain the cause of the problem?

Thank you for any help and advices!

R-Version 4.2.1. RStudio: "Spotted Wakerobin" Release (7872775e, 2022-07-22) for Ubuntu Bionic

CodePudding user response：

I suspect it may be necessary to ungroup() the data.frame before calculating prop and other variables.

Note that n and sum_n are the same here, leading to prop = 1:

library(forcats)
library(tidyverse)

options(scipen = 999) # This makes it easy to inspect

gss_cat %>% 
  filter(!is.na(age)) %>% 
  group_by(age, marital) %>% 
  count() %>% 
  mutate(
    n = n,
    sum_n = sum(n),
    prop= n/sum(n)) %>% 
  as.data.frame()

#     age       marital   n sum_n prop
# 1    18 Never married  89    89    1
# 2    18       Married   2     2    1
# 3    19 Never married 234   234    1
# 4    19      Divorced   3     3    1
# 5    19       Widowed   1     1    1
# 6    19       Married  11    11    1
# 7    20 Never married 227   227    1
# 8    20     Separated   1     1    1
# 9    20      Divorced   2     2    1
# 10   20       Married  21    21    1

But after using ungroup() first it gives the correct proportion:

gss_cat %>% 
  filter(!is.na(age)) %>% 
  group_by(age, marital) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(
    n = n,
    sum_n = sum(n),
    prop= n/sum(n)) %>% 
  as.data.frame()

#     age       marital   n sum_n          prop
# 1    18 Never married  89 21407 0.00415751857
# 2    18       Married   2 21407 0.00009342738
# 3    19 Never married 234 21407 0.01093100388
# 4    19      Divorced   3 21407 0.00014014108
# 5    19       Widowed   1 21407 0.00004671369
# 6    19       Married  11 21407 0.00051385061
# 7    20 Never married 227 21407 0.01060400803
# 8    20     Separated   1 21407 0.00004671369
# 9    20      Divorced   2 21407 0.00009342738
# 10   20       Married  21 21407 0.00098098753

CodePudding user response：

base::sum() and sum() are equal. <library name>:: is used to find a function inside the specific library without loading the whole library. However, base is loaded by default, so sum() function is already loaded from there.
What are you doing in this code is:

a. Filter non-NaN values (removing missings)

b. Grouping all the rows by age and marital value. So all the following operations will be done based in these groups.

c. count() returns the number of samples per these groups. So it is like writting count(age, marital).

d. You are creating a new variable based on n value for the sample and the sum(n) for values in that group which returns 1 because you have summarised the groups with count().

If you want to know the proportion of every group / sum of all n values, you should ungroup() the values before calling mutate(). Or, actually, you don't need to group_by the data:

by_age <- gss_cat %>% filter(!is.na(age)) %>%
  count(age, marital) %>%  mutate(prop= n/sum(n))

Returns:

# A tibble: 351 × 4
     age marital           n      prop
   <int> <fct>         <int>     <dbl>
 1    18 Never married    89 0.00416  
 2    18 Married           2 0.0000934
 3    19 Never married   234 0.0109   
 4    19 Divorced          3 0.000140 
 5    19 Widowed           1 0.0000467
 6    19 Married          11 0.000514 
 7    20 Never married   227 0.0106   
 8    20 Separated         1 0.0000467
 9    20 Divorced          2 0.0000934
10    20 Married          21 0.000981 
# … with 341 more rows