Home > database >  Making a table that contains Mean and SD of a Dataset
Making a table that contains Mean and SD of a Dataset

Time:09-21

I am using this dataset: http://www.openintro.org/stat/data/cdc.R to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:

                  Mean                  Standard Deviation
Age:              44.27                 16.715
Height:           70.25                 3.009219
Weight:           189.3                 36.55036
Desired Weight:   178.6                 26.25121

I created a subset for males and females with this code:

mdata <- subset(cdc, cdc$gender == ("m"))             
fdata <- subset(cdc, cdc$gender == ("f"))

How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?

CodePudding user response:

The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:

library(dplyr)
library(tidyr)

starwars |> 
  group_by(sex) |> 
  summarise(across(
    where(is.numeric), 
    .fns = list(Mean = mean, SD = sd), na.rm = TRUE, 
    .names = "{col}__{fn}"
  )) |> 
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")  

# A tibble: 15 × 4
   sex            var          Mean     SD
   <chr>          <chr>       <dbl>  <dbl>
 1 female         height      169.   15.3 
 2 female         mass         54.7   8.59
 3 female         birth_year   47.2  15.0 
 4 hermaphroditic height      175    NA   
 5 hermaphroditic mass       1358    NA   
 6 hermaphroditic birth_year  600    NA   
 7 male           height      179.   36.0 
 8 male           mass         81.0  28.2 
 9 male           birth_year   85.5 157.  
10 none           height      131.   49.1 
11 none           mass         69.8  51.0 
12 none           birth_year   53.3  51.6 
13 NA             height      181.    2.89
14 NA             mass         48    NA   
15 NA             birth_year   62    NA   

CodePudding user response:

Just make a data frame of colMeans and column sd. Note, that you may also select columns.

fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
#               mean        sd
# age       45.79772 17.584420
# height    64.36775  2.787304
# weight   151.66619 34.297519
# wtdesire 133.51500 18.963014

You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)

res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
  data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
#               mean        sd
# age       44.27307 16.719940
# height    70.25165  3.009219
# weight   189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------ 
# cdc$gender: f
#               mean        sd
# age       45.79772 17.584420
# height    64.36775  2.787304
# weight   151.66619 34.297519
# wtdesire 133.51500 18.963014

To extract only one of the data frames in the list-like object use e.g. res$m.

Usually we use aggregate for this, which you also might consider:

aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
  do.call(what=data.frame)
#   gender age.mean   age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1      m 44.27307 16.71994   70.251646  3.009219   189.32271  36.55036     178.61657    26.25121
# 2      f 45.79772 17.58442   64.367750  2.787304   151.66619  34.29752     133.51500    18.96301

The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.

Note: R >= 4.1 used.


Data:

source('https://www.openintro.org/stat/data/cdc.R')

or

cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent", 
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0, 
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1, 
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L, 
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L, 
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L, 
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995", 
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")
  • Related