I am working with the R programming language.
I have the following dataset:
set.seed(123)
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status )
country <- c("A", "B", "C", "D")
country <- sample(country, 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
country <- as.factor(country)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)
###################
my_data = data.frame(gender, status, disease, country)
I want to find out the relative percentage of each unique group of factors that have the disease vs do not have the disease.
As an example:
- What percentage of Male Immigrants from Country A have the disease vs don't have the disease
- What percentage of Male Citizens from Country A have the disease vs don't have the disease (both these percentages should add to 1)
- etc.
I tried to do this with the following code:
# https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr
library(dplyr)
step_1 = my_data %>% group_by (gender, status, country, disease) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
`summarise()` has grouped output by 'gender', 'status', 'country'. You can override using the
`.groups` argument.
# A tibble: 32 x 6
# Groups: gender, status, country [16]
gender status country disease n rel.freq
<fct> <fct> <fct> <fct> <int> <chr>
1 Female Citizen A No 285 60%
2 Female Citizen A Yes 193 40%
Now (assuming this is correct), I am trying to make some modifications to this result - this should reduce the number of rows by half (i.e. two rows in step_1 for yes and no are now combined into a single row):
step_2 = step_1 %>%
group_by(gender, status, country) %>%
summarize(disease = first(disease),
# number of people in this row that do not have the disease
n_no = sum(disease == "No"),
# number of people in this row that do have the disease
n_yes = sum(disease == "Yes"),
# relative percent of people in this row that do not have the disease
n_no_rel_freq = paste(round(sum(disease == "No") / sum(n) * 100), "%"),
# relative percent of people in this row that do have the disease
n_yes_rel_freq = paste(round(sum(disease == "Yes") / sum(n) * 100), "%"),
# overall percent of all people in this row relative to entire population
overall_percent = sum(n) / sum(step_1$n))
The code seems to run - but many of the percentages are now 0:
# A tibble: 16 x 9
# Groups: gender, status [4]
gender status country disease n_no n_yes n_no_rel_freq n_yes_rel_freq overall_percent
<fct> <fct> <fct> <fct> <int> <int> <chr> <chr> <dbl>
1 Female Citizen A No 1 0 0 % 0 % 0.102
2 Female Citizen B No 1 0 0 % 0 % 0.092
Can someone please show me how to fix this?
Thanks!
Note: The final result should look something like this
# desired result (sample)
gender status country n_no n_yes n_no_rel_freq n_yes_rel_freq total overall_percent
1 female citizen A 285 193 0.6 0.4 478 0.0956
CodePudding user response:
I would solve this with the help of data.table
:
install(data.table)
setDT(my_data)
my_data[, .N, by = .(gender, status, country, disease)][
, dcast(.SD, gender status country~disease, value.var = "N")][
, rel.freq := Yes/(No Yes)][]
What is in there:
- You install data.table
- You convert
my_data
to a data.table (setDT(my_data)
) - With
my_data[, .N, by = .(gender, status, country, disease)]
you count cases (.N
) grouped by all the variables afterby=
. - With
[, dcast(.SD, gender status country~disease, value.var = "N")]
you counvert your long table into a wide one, leaving the levels ofdisease
as new column headers and summing onN
, which is the number of cases. - With
[, rel.freq := Yes/(No Yes)]
you create a new variablerel.freq
that is the result of dividing the positive cases in the total cases. - With the
[]
you display the result to screen (you don't need this step, if you want to assign the result to a new object).
This is the result I obtained:
gender status country No Yes rel.freq
1: Female Citizen A 308 200 0.3937008
2: Female Citizen B 291 169 0.3673913
3: Female Citizen C 301 228 0.4310019
4: Female Citizen D 245 189 0.4354839
5: Female Immigrant A 107 95 0.4702970
6: Female Immigrant B 126 76 0.3762376
7: Female Immigrant C 137 70 0.3381643
8: Female Immigrant D 129 74 0.3645320
9: Male Citizen A 237 167 0.4133663
10: Male Citizen B 247 163 0.3975610
11: Male Citizen C 250 171 0.4061758
12: Male Citizen D 230 139 0.3766938
13: Male Immigrant A 103 68 0.3976608
14: Male Immigrant B 117 63 0.3500000
15: Male Immigrant C 93 53 0.3630137
16: Male Immigrant D 102 52 0.3376623
CodePudding user response:
Maybe you can make use of pivot-wider
.
library(tidyverse)
my_data %>% group_by (gender, status, country, disease) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%")) -> step_1
#> `summarise()` has grouped output by 'gender', 'status', 'country'. You can
#> override using the `.groups` argument.
step_1 |>group_by(country) |>
pivot_wider(names_from = disease,
values_from = c(n:rel.freq),
names_prefix = "disease_") |>
mutate(overallPerc = (n_disease_No n_disease_Yes)/sum(step_1$n))
#> # A tibble: 16 × 8
#> # Groups: country [4]
#> gender status country n_disease_No n_disease_Yes rel.fre…¹ rel.f…² overa…³
#> <fct> <fct> <fct> <int> <int> <chr> <chr> <dbl>
#> 1 Female Citizen A 308 200 61% 39% 0.102
#> 2 Female Citizen B 291 169 63% 37% 0.092
#> 3 Female Citizen C 301 228 57% 43% 0.106
#> 4 Female Citizen D 245 189 56% 44% 0.0868
#> 5 Female Immigrant A 107 95 53% 47% 0.0404
#> 6 Female Immigrant B 126 76 62% 38% 0.0404
#> 7 Female Immigrant C 137 70 66% 34% 0.0414
#> 8 Female Immigrant D 129 74 64% 36% 0.0406
#> 9 Male Citizen A 237 167 59% 41% 0.0808
#> 10 Male Citizen B 247 163 60% 40% 0.082
#> 11 Male Citizen C 250 171 59% 41% 0.0842
#> 12 Male Citizen D 230 139 62% 38% 0.0738
#> 13 Male Immigrant A 103 68 60% 40% 0.0342
#> 14 Male Immigrant B 117 63 65% 35% 0.036
#> 15 Male Immigrant C 93 53 64% 36% 0.0292
#> 16 Male Immigrant D 102 52 66% 34% 0.0308
#> # … with abbreviated variable names ¹rel.freq_disease_No,
#> # ²rel.freq_disease_Yes, ³overallPerc