Home > Enterprise >  Summary statistics with grouping by multiple columns dataframe vs. data.table vs. dplyr
Summary statistics with grouping by multiple columns dataframe vs. data.table vs. dplyr

Time:08-14

I'm working my way through the Titanic Study in Frank Harrell's R Flow course (http://hbiostat.org/rflow/case.html) and have a question about summarizing data. The raw data (Titanic5.csv) can be downloaded from https://hbiostat.org/data/repo/titanic5.csv He uses data.table (d is a data.table) to summarize the dataset as follows:

# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex]    # .N= # obs in by group

The result of the last command is:

      sex Proportion   N
1: female  0.7274678 466
2:   male  0.1909846 843

Even more interesting is

d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]

which gives

      sex class Proportion   N
1: female     1  0.9652778 144
2:   male     1  0.3444444 180
3:   male     2  0.1411765 170
4: female     2  0.8867925 106
5:   male     3  0.1521298 493
6: female     3  0.4907407 216

The results are exactly what I want,but the syntax depends very strongly on the capabilities of data.table. How can I get the same results using a dataframe instead of a data table, ideallly with base R, but also with dplyr?

Sincerely

Thomas Philips

CodePudding user response:

Concerning dplyr, the most straight-forward way for achieving this would probably be the usual combination of group_by() and summarize(). Summarizing the data by Sex and Class may be done in the following way:

d %>%  
  group_by(Sex, Class) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
`summarise()` has grouped output by 'Sex'. You can
override using the `.groups` argument.
# A tibble: 6 × 4
# Groups:   Sex [2]
  Sex    Class Proportion     N
  <chr>  <dbl>      <dbl> <int>
1 female     1      0.965   144
2 female     2      0.887   106
3 female     3      0.491   216
4 male       1      0.344   180
5 male       2      0.141   170
6 male       3      0.152   493

And just summarizing by Sex:

d %>%  
  group_by(Sex) %>% 
  summarize(
    Proportion = mn(Survived), 
    N = Nna(Survived)
  )

# Output
# A tibble: 2 × 3
  Sex    Proportion     N
  <chr>       <dbl> <int>
1 female      0.727   466
2 male        0.191   843

Finally, here also a solutions using stats which I assume is close enough to "base R"(?). It's not the most elegant solution, i.e. maybe one can immediately assign the variables with respective variable names and functions, but it works well:

summary_mn <- stats::aggregate(Survived ~ Sex   Class, data = as.data.frame(d), FUN = function(x) mn(x))
summary_nna <- stats::aggregate(Survived ~ Sex   Class, data = as.data.frame(d), FUN = function(x) Nna(x))

summary_full <- merge(summary_mn, summary_nna, by = c("Sex", "Class"))
colnames(summary_full) <- c("Sex", "Class", "Proportion", "N")

summary_full

# Output
     Sex Class Proportion   N
1 female     1  0.9652778 144
2 female     2  0.8867925 106
3 female     3  0.4907407 216
4   male     1  0.3444444 180
5   male     2  0.1411765 170
6   male     3  0.1521298 493

CodePudding user response:

  • With base R
aggregate(Survived ~ Sex , d ,
          \(x) c(Proportion = mean(x) , N = length(x)))

aggregate(Survived ~ Sex   Class, d , 
          \(x) c(Proportion = mean(x) , N = length(x)))
  • Output
#> first

    Sex Survived.Proportion  Survived.N
1 female           0.7274678    466
2   male           0.1909846    843

#> second

    Sex Class Survived.Proportion  Survived.N
1 female     1           0.9652778   144
2   male     1           0.3444444   180
3 female     2           0.8867925   106
4   male     2           0.1411765   170
5 female     3           0.4907407   216
6   male     3           0.1521298   493

  • Related