Home > database >  Subsetting and Summing within categories in R
Subsetting and Summing within categories in R

Time:07-12

I am trying to subset and compute sums from within groups, based on certain conditions. All of this should be possible with base R, but I am unsure how to make it work properly.

I have a basic dataframe with many variables.

I am interested in variables A (categorical) and B (T/F),

I need to sum the number of observations/occurrences for each category of A, for which B is True. Then, I need to sum the total number of observations within each category of A, (whether B is True or False). Ultimately, I will then divide (within each category of A) the number of observations of True by the number of observations to find the share of True (within each individual category).

For example:

There are 1200 observations of A = "Category1". For 600 of these observations, B = TRUE. So the share should be 0.5 for Category1 There are 850 observations of category A = "Category2". For 10 of these observations, B = TRUE. So the share should be 0.0118 for Category2.

I recognize this should be simple, however, I am struggling to find the correct code using subset(), etc. I know that I could completely subset the data into each category as multiple independent objects, and then compute the share for each category individually, however, there are >100 categories, so this manual subsetting/computation would not be feasible. Is there a way to automatically compute the share given these conditions for each category (Categories 1-100 ), so that the manual computation can be avoided? What would be the most concise way to perform this computation?

CodePudding user response:

If you want to do it with base as asked (*), you could use aggregate (or tapply):

data <- data.frame(A = c(rep("Category1", 1200),
                         rep("Category2", 850)),
                   B = c(rep(FALSE, 600), rep(TRUE, 600),
                         rep(FALSE, 840), rep(TRUE, 10))
                   )

aggregate(B ~ A, data, mean)

Output:

          A          B
1 Category1 0.50000000
2 Category2 0.01176471

(*) With dplyr it'd look like:

library(dplyr)

data |>
  group_by(A) |>
  summarise(share = mean(B))

dplyr output as requested:

# A tibble: 2 × 2
  A          share
  <chr>      <dbl>
1 Category1 0.5   
2 Category2 0.0118

Update: Minor correction.

  • Related