upon learning some data science tools in R, I stumbled upon the following "error": (this is code from chapter 12 of "R for Data Science" by Wickham & Grolemund):
library(forcats)
library(tidyverse)
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
mutate(prop= n/sum(n))
In contrast to the statement in the book, I incorrectly get a prop column containing only 1 (e.g. 100%) for each row. Now, I have tried to modify this example and got the following additional information:
- If I am more explicit via calling
base::sum
instead ofsum
the error still remains - If I compute the sum outside of the pipe in an extra statement and then use the computed number in the statement (i.e. instead of
n/sum(n)
-> n/21407), I get the correct proportions.
Can anyone maybe help me and explain the cause of the problem?
Thank you for any help and advices!
R-Version 4.2.1. RStudio: "Spotted Wakerobin" Release (7872775e, 2022-07-22) for Ubuntu Bionic
CodePudding user response:
I suspect it may be necessary to ungroup()
the data.frame before calculating prop and other variables.
Note that n and sum_n are the same here, leading to prop = 1:
library(forcats)
library(tidyverse)
options(scipen = 999) # This makes it easy to inspect
gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
mutate(
n = n,
sum_n = sum(n),
prop= n/sum(n)) %>%
as.data.frame()
# age marital n sum_n prop
# 1 18 Never married 89 89 1
# 2 18 Married 2 2 1
# 3 19 Never married 234 234 1
# 4 19 Divorced 3 3 1
# 5 19 Widowed 1 1 1
# 6 19 Married 11 11 1
# 7 20 Never married 227 227 1
# 8 20 Separated 1 1 1
# 9 20 Divorced 2 2 1
# 10 20 Married 21 21 1
But after using ungroup()
first it gives the correct proportion:
gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
ungroup() %>%
mutate(
n = n,
sum_n = sum(n),
prop= n/sum(n)) %>%
as.data.frame()
# age marital n sum_n prop
# 1 18 Never married 89 21407 0.00415751857
# 2 18 Married 2 21407 0.00009342738
# 3 19 Never married 234 21407 0.01093100388
# 4 19 Divorced 3 21407 0.00014014108
# 5 19 Widowed 1 21407 0.00004671369
# 6 19 Married 11 21407 0.00051385061
# 7 20 Never married 227 21407 0.01060400803
# 8 20 Separated 1 21407 0.00004671369
# 9 20 Divorced 2 21407 0.00009342738
# 10 20 Married 21 21407 0.00098098753
CodePudding user response:
base::sum()
andsum()
are equal.<library name>::
is used to find a function inside the specific library without loading the whole library. However,base
is loaded by default, sosum()
function is already loaded from there.What are you doing in this code is:
a. Filter non-NaN values (removing missings)
b. Grouping all the rows by age and marital value. So all the following operations will be done based in these groups.
c. count()
returns the number of samples per these groups. So it is like writting count(age, marital)
.
d. You are creating a new variable based on n
value for the sample and the sum(n)
for values in that group which returns 1 because you have summarised the groups with count()
.
If you want to know the proportion of every group / sum of all n values, you should ungroup()
the values before calling mutate()
. Or, actually, you don't need to group_by
the data:
by_age <- gss_cat %>% filter(!is.na(age)) %>%
count(age, marital) %>% mutate(prop= n/sum(n))
Returns:
# A tibble: 351 × 4
age marital n prop
<int> <fct> <int> <dbl>
1 18 Never married 89 0.00416
2 18 Married 2 0.0000934
3 19 Never married 234 0.0109
4 19 Divorced 3 0.000140
5 19 Widowed 1 0.0000467
6 19 Married 11 0.000514
7 20 Never married 227 0.0106
8 20 Separated 1 0.0000467
9 20 Divorced 2 0.0000934
10 20 Married 21 0.000981
# … with 341 more rows