Home > Software engineering >  How to summarise columns independently
How to summarise columns independently

Time:10-13

I am attempting to summarize a table similar to the table below, and calculate the count and proportion of the categorical value in eth_source_1, eth_source_2, and eth_source_3, independently of each other, then display the output in a summary.

| patient_id | eth_source_1 | eth_source_2 | eth_source_3 |
|------------|--------------|--------------|--------------|
| 1          | White        | N/A          | White        |
| 2          | South Asian  | South_Asian  | South_Asian  |
| 3          | Mixed        | Mixed        | N/A          |

I have tried to group_by each column in the same statement (group_by(eth_source_1, eth_source_2, eth_source_3)) however this is, as expected, does not operate on each column independently, but as a matrix of all the possible combinations.

I have managed to make it work by creating each summary independently, then joining these summaries on the eth column, but I know there is a cleaner way to do it and i'm just not seeing it. What is the cleanest method of doing this?

summary_a <- patient_ethn %>%
  group_by(eth = eth_source_1) %>%
  summarise(n_1 = n()) %>%
  mutate (freq_1 = formattable::percent (n_1/sum(n_1)))

summary_b <- patient_ethn %>%
  group_by(eth = eth_source_2) %>%
  summarise(n_2 = n()) %>%
  mutate (freq_2 = formattable::percent (n_2/sum(n_2)))

summary_c <- patient_ethn %>%
  group_by(eth = eth_source_3) %>%
  summarise(n_3 = n()) %>%
  mutate (freq_3 = formattable::percent (n_3/sum(n_3)))

patient_summary <- full_join(summary_a, summary_b, by='eth')
full_join(patient_summary, summary_c, by='eth')

Sample output of the above code, and what is desired.

# A tibble: 6 × 7
  eth                      n_1   freq_1       n_2   freq_2       n_3   freq_3
  <chr>                    <int> <formttbl>   <int> <formttbl>   <int> <formttbl>
1 White                     #### #.##%         #### #.##%         #### #.##%         
2 South Asian               #### #.##%         #### #.##%         #### #.##%     
3 Mixed                     #### #.##%         #### #.##%         #### #.##%   
4 NA                        #### #.##%         #### #.##%         #### #.##%  

Thank you

CodePudding user response:

Here is a dplyr solution

library(dplyr)

# Making up a data frame
eth <- c("White","South Asian", "Mixed", NA)

df <- tibble(
  patient_id = seq(1,50,1),
  eth_source_1 = sample(eth, size = 50, replace = TRUE),
  eth_source_2 = sample(eth, size = 50, replace = TRUE),
  eth_source_3 = sample(eth, size = 50, replace = TRUE)
)

# Summarize each source
df %>%
  pivot_longer(starts_with("eth"),
               names_to = "eth_source",
               values_to = "eth") %>%
  group_by(eth_source) %>%
  count(eth) %>%
# Add percentage
  mutate(freq = scales::percent(n/sum(n))) %>%
  pivot_wider(names_from = "eth_source",
  values_from = c("n", "freq"),
  names_vary = "slowest")

  • Related