I'm working my way through the Titanic Study in Frank Harrell's R Flow course (http://hbiostat.org/rflow/case.html) and have a question about summarizing data. The raw data (Titanic5.csv) can be downloaded from https://hbiostat.org/data/repo/titanic5.csv He uses data.table (d is a data.table) to summarize the dataset as follows:
# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex] # .N= # obs in by group
The result of the last command is:
sex Proportion N
1: female 0.7274678 466
2: male 0.1909846 843
Even more interesting is
d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]
which gives
sex class Proportion N
1: female 1 0.9652778 144
2: male 1 0.3444444 180
3: male 2 0.1411765 170
4: female 2 0.8867925 106
5: male 3 0.1521298 493
6: female 3 0.4907407 216
The results are exactly what I want,but the syntax depends very strongly on the capabilities of data.table. How can I get the same results using a dataframe instead of a data table, ideallly with base R, but also with dplyr?
Sincerely
Thomas Philips
CodePudding user response:
Concerning dplyr
, the most straight-forward way for achieving this would probably be the usual combination of group_by()
and summarize()
. Summarizing the data by Sex
and Class
may be done in the following way:
d %>%
group_by(Sex, Class) %>%
summarize(
Proportion = mn(Survived),
N = Nna(Survived)
)
# Output
`summarise()` has grouped output by 'Sex'. You can
override using the `.groups` argument.
# A tibble: 6 × 4
# Groups: Sex [2]
Sex Class Proportion N
<chr> <dbl> <dbl> <int>
1 female 1 0.965 144
2 female 2 0.887 106
3 female 3 0.491 216
4 male 1 0.344 180
5 male 2 0.141 170
6 male 3 0.152 493
And just summarizing by Sex
:
d %>%
group_by(Sex) %>%
summarize(
Proportion = mn(Survived),
N = Nna(Survived)
)
# Output
# A tibble: 2 × 3
Sex Proportion N
<chr> <dbl> <int>
1 female 0.727 466
2 male 0.191 843
Finally, here also a solutions using stats
which I assume is close enough to "base R"(?). It's not the most elegant solution, i.e. maybe one can immediately assign the variables with respective variable names and functions, but it works well:
summary_mn <- stats::aggregate(Survived ~ Sex Class, data = as.data.frame(d), FUN = function(x) mn(x))
summary_nna <- stats::aggregate(Survived ~ Sex Class, data = as.data.frame(d), FUN = function(x) Nna(x))
summary_full <- merge(summary_mn, summary_nna, by = c("Sex", "Class"))
colnames(summary_full) <- c("Sex", "Class", "Proportion", "N")
summary_full
# Output
Sex Class Proportion N
1 female 1 0.9652778 144
2 female 2 0.8867925 106
3 female 3 0.4907407 216
4 male 1 0.3444444 180
5 male 2 0.1411765 170
6 male 3 0.1521298 493
CodePudding user response:
- With
base R
aggregate(Survived ~ Sex , d ,
\(x) c(Proportion = mean(x) , N = length(x)))
aggregate(Survived ~ Sex Class, d ,
\(x) c(Proportion = mean(x) , N = length(x)))
- Output
#> first
Sex Survived.Proportion Survived.N
1 female 0.7274678 466
2 male 0.1909846 843
#> second
Sex Class Survived.Proportion Survived.N
1 female 1 0.9652778 144
2 male 1 0.3444444 180
3 female 2 0.8867925 106
4 male 2 0.1411765 170
5 female 3 0.4907407 216
6 male 3 0.1521298 493