I am attempting to summarize a table similar to the table below, and calculate the count and proportion of the categorical value in eth_source_1, eth_source_2, and eth_source_3, independently of each other, then display the output in a summary.
| patient_id | eth_source_1 | eth_source_2 | eth_source_3 |
|------------|--------------|--------------|--------------|
| 1 | White | N/A | White |
| 2 | South Asian | South_Asian | South_Asian |
| 3 | Mixed | Mixed | N/A |
I have tried to group_by
each column in the same statement (group_by(eth_source_1, eth_source_2, eth_source_3)
) however this is, as expected, does not operate on each column independently, but as a matrix of all the possible combinations.
I have managed to make it work by creating each summary independently, then joining these summaries on the eth
column, but I know there is a cleaner way to do it and i'm just not seeing it. What is the cleanest method of doing this?
summary_a <- patient_ethn %>%
group_by(eth = eth_source_1) %>%
summarise(n_1 = n()) %>%
mutate (freq_1 = formattable::percent (n_1/sum(n_1)))
summary_b <- patient_ethn %>%
group_by(eth = eth_source_2) %>%
summarise(n_2 = n()) %>%
mutate (freq_2 = formattable::percent (n_2/sum(n_2)))
summary_c <- patient_ethn %>%
group_by(eth = eth_source_3) %>%
summarise(n_3 = n()) %>%
mutate (freq_3 = formattable::percent (n_3/sum(n_3)))
patient_summary <- full_join(summary_a, summary_b, by='eth')
full_join(patient_summary, summary_c, by='eth')
Sample output of the above code, and what is desired.
# A tibble: 6 × 7
eth n_1 freq_1 n_2 freq_2 n_3 freq_3
<chr> <int> <formttbl> <int> <formttbl> <int> <formttbl>
1 White #### #.##% #### #.##% #### #.##%
2 South Asian #### #.##% #### #.##% #### #.##%
3 Mixed #### #.##% #### #.##% #### #.##%
4 NA #### #.##% #### #.##% #### #.##%
Thank you
CodePudding user response:
Here is a dplyr
solution
library(dplyr)
# Making up a data frame
eth <- c("White","South Asian", "Mixed", NA)
df <- tibble(
patient_id = seq(1,50,1),
eth_source_1 = sample(eth, size = 50, replace = TRUE),
eth_source_2 = sample(eth, size = 50, replace = TRUE),
eth_source_3 = sample(eth, size = 50, replace = TRUE)
)
# Summarize each source
df %>%
pivot_longer(starts_with("eth"),
names_to = "eth_source",
values_to = "eth") %>%
group_by(eth_source) %>%
count(eth) %>%
# Add percentage
mutate(freq = scales::percent(n/sum(n))) %>%
pivot_wider(names_from = "eth_source",
values_from = c("n", "freq"),
names_vary = "slowest")