Home > Software design >  Combining factor levels whilst keeping original
Combining factor levels whilst keeping original

Time:03-22

I would like to combine outcomes for participant groups, whilst keeping the separate levels. For example, I am aggregating data across multiple columns using:

aggregate(list(x$variable1, x$variable 2, x$variable 3), by = list (x$date, x$site, x$participanttype), FUN=mean, na-rm = TRUE)

There are 2 levels to participant type (SU and staff). The above gives me the aggregated outcomes for SU and staff but I also want a total that combines SU and staff. I've seen the following; however, they are only summing one variable (Y) and I have hundreds. It would be open to too many errors if I had to input them all - never mind the time:

Data %>%
  add_row(X = "Total", Y = sum(.$Y)) %>%
  mutate(X = factor(X)) 

How can I retain outcomes for SU and staff whilst having an additional 'total' output? I include a minimal dataset below.

Data<-data.frame(
X = factor(c("SU", "Staff")),
V1 = c(1000, 2000, 3000, 4000))
V2 = c(4000, 3000, 9000, 1000))

Thank you.

CodePudding user response:

Here is a solution using dplyr::summarize() instead of aggregate().

library(tidyverse)
Data <- data.frame(
  X = factor(c("SU", "Staff")),
  V1 = c(1000, 2000, 3000, 4000),
  V2 = c(4000, 3000, 9000, 1000)
)

Data %>%
  group_by(X) %>%
  summarize_all(sum) %>% 
  bind_rows(prepend(summarize_if(., is.numeric, sum), list(X = "total")))
#> # A tibble: 3 × 3
#>   X        V1    V2
#>   <chr> <dbl> <dbl>
#> 1 Staff  6000  4000
#> 2 SU     4000 13000
#> 3 total 10000 17000

Created on 2022-03-21 by the reprex package (v2.0.1)

CodePudding user response:

Since you had written that you wanted to keep the original values and collect the sum values, I thought I would add to the great information you were already given. However, regardless of how many columns you have, this works. You simply have to select the range of columns by selecting the first and last columns to aggregate. That's shown here as V1:V2. It is odd here with only two columns, but this is dynamic and can accommodate any number of columns.

I'm going to guess that there is a better way of doing this, but this works.

Data %>% 
  rowid_to_column() %>%             # original order; perception of duplicates
  pivot_longer(cols = c(V1:V2), 
               names_to = "name",   # now all groups are in one column
               values_to = "values") %>%
  group_by(X, name) %>% 
  summarise(id = rowid,
            Tot = sum(values),      # collect the total for all groups
            values = values, 
            .groups = "drop") %>% 
  pivot_wider(names_from = name, 
              values_from = c(Tot, values), 
              names_glue = "{name}_{.value}") %>%  # groups to columns
  arrange(id) %>% select(-id)                      # return to original order
# # A tibble: 4 × 5
#   X     V1_Tot V2_Tot V1_values V2_values
#   <fct>  <dbl>  <dbl>     <dbl>     <dbl>
# 1 SU      4000  13000      1000      4000
# 2 Staff   6000   4000      2000      3000
# 3 SU      4000  13000      3000      9000
# 4 Staff   6000   4000      4000      1000 

Data  # original data frame
#       X   V1   V2
# 1    SU 1000 4000
# 2 Staff 2000 3000
# 3    SU 3000 9000
# 4 Staff 4000 1000 
  • Related