Making a table that contains Mean and SD of a Dataset-CodePudding

I am using this dataset: http://www.openintro.org/stat/data/cdc.R to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:

                  Mean                  Standard Deviation
Age:              44.27                 16.715
Height:           70.25                 3.009219
Weight:           189.3                 36.55036
Desired Weight:   178.6                 26.25121

I created a subset for males and females with this code:

mdata <- subset(cdc, cdc$gender == ("m"))             
fdata <- subset(cdc, cdc$gender == ("f"))

How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?

CodePudding user response：

The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:

library(dplyr)
library(tidyr)

starwars |> 
  group_by(sex) |> 
  summarise(across(
    where(is.numeric), 
    .fns = list(Mean = mean, SD = sd), na.rm = TRUE, 
    .names = "{col}__{fn}"
  )) |> 
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")  

# A tibble: 15 × 4
   sex            var          Mean     SD
   <chr>          <chr>       <dbl>  <dbl>
 1 female         height      169.   15.3 
 2 female         mass         54.7   8.59
 3 female         birth_year   47.2  15.0 
 4 hermaphroditic height      175    NA   
 5 hermaphroditic mass       1358    NA   
 6 hermaphroditic birth_year  600    NA   
 7 male           height      179.   36.0 
 8 male           mass         81.0  28.2 
 9 male           birth_year   85.5 157.  
10 none           height      131.   49.1 
11 none           mass         69.8  51.0 
12 none           birth_year   53.3  51.6 
13 NA             height      181.    2.89
14 NA             mass         48    NA   
15 NA             birth_year   62    NA

CodePudding user response：

Just make a data frame of colMeans and column sd. Note, that you may also select columns.

fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
#               mean        sd
# age       45.79772 17.584420
# height    64.36775  2.787304
# weight   151.66619 34.297519
# wtdesire 133.51500 18.963014

You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)

res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
  data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
#               mean        sd
# age       44.27307 16.719940
# height    70.25165  3.009219
# weight   189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------ 
# cdc$gender: f
#               mean        sd
# age       45.79772 17.584420
# height    64.36775  2.787304
# weight   151.66619 34.297519
# wtdesire 133.51500 18.963014

To extract only one of the data frames in the list-like object use e.g. res$m.

Usually we use aggregate for this, which you also might consider:

aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
  do.call(what=data.frame)
#   gender age.mean   age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1      m 44.27307 16.71994   70.251646  3.009219   189.32271  36.55036     178.61657    26.25121
# 2      f 45.79772 17.58442   64.367750  2.787304   151.66619  34.29752     133.51500    18.96301

The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.

Note: R >= 4.1 used.

Data:

source('https://www.openintro.org/stat/data/cdc.R')

cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent", 
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0, 
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1, 
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L, 
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L, 
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L, 
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995", 
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")