R: Pivoting Grouped Frequencies In Terms of their Counts-CodePudding

I am working with the R programming language.

I have the following dataset:

set.seed(123)
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)


status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  <- as.factor(status )


country <- c("A", "B", "C", "D")
country <- sample(country, 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
country  <- as.factor(country)



################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)

###################
my_data = data.frame(gender, status, disease, country)

I want to find out the relative percentage of each unique group of factors that have the disease vs do not have the disease.

As an example:

What percentage of Male Immigrants from Country A have the disease vs don't have the disease
What percentage of Male Citizens from Country A have the disease vs don't have the disease (both these percentages should add to 1)
etc.

I tried to do this with the following code:

# https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr
library(dplyr)

step_1 = my_data %>% group_by (gender, status, country, disease) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))


`summarise()` has grouped output by 'gender', 'status', 'country'. You can override using the
`.groups` argument.
# A tibble: 32 x 6
# Groups:   gender, status, country [16]
   gender status    country disease     n rel.freq
   <fct>  <fct>     <fct>   <fct>   <int> <chr>   
 1 Female Citizen   A       No        285 60%     
 2 Female Citizen   A       Yes       193 40%

Now (assuming this is correct), I am trying to make some modifications to this result - this should reduce the number of rows by half (i.e. two rows in step_1 for yes and no are now combined into a single row):

step_2 = step_1 %>% 
  group_by(gender, status, country) %>% 
  summarize(disease = first(disease),
# number of people in this row that do not have the disease
            n_no = sum(disease == "No"),
# number of people in this row that do have the disease
            n_yes = sum(disease == "Yes"),
# relative percent of people in this row that do not  have the disease
            n_no_rel_freq = paste(round(sum(disease == "No") / sum(n) * 100), "%"),
# relative percent of people in this row that do have the disease
            n_yes_rel_freq = paste(round(sum(disease == "Yes") / sum(n) * 100), "%"),
# overall percent of all people in this row relative to entire population
            overall_percent = sum(n) / sum(step_1$n))

The code seems to run - but many of the percentages are now 0:

# A tibble: 16 x 9
# Groups:   gender, status [4]
   gender status    country disease  n_no n_yes n_no_rel_freq n_yes_rel_freq overall_percent
   <fct>  <fct>     <fct>   <fct>   <int> <int> <chr>         <chr>                    <dbl>
 1 Female Citizen   A       No          1     0 0 %           0 %                     0.102 
 2 Female Citizen   B       No          1     0 0 %           0 %                     0.092

Can someone please show me how to fix this?

Thanks!

Note: The final result should look something like this

    # desired result (sample)
  gender  status country n_no n_yes n_no_rel_freq n_yes_rel_freq total overall_percent
1 female citizen       A  285   193           0.6            0.4   478          0.0956

CodePudding user response：

I would solve this with the help of data.table:

install(data.table)

setDT(my_data)

my_data[, .N, by = .(gender, status, country, disease)][
           , dcast(.SD, gender status country~disease, value.var = "N")][
                    , rel.freq := Yes/(No Yes)][]

What is in there:

You install data.table

You convert my_data to a data.table (setDT(my_data))
With my_data[, .N, by = .(gender, status, country, disease)] you count cases (.N) grouped by all the variables after by=.
With [, dcast(.SD, gender status country~disease, value.var = "N")] you counvert your long table into a wide one, leaving the levels of disease as new column headers and summing on N, which is the number of cases.
With [, rel.freq := Yes/(No Yes)] you create a new variable rel.freq that is the result of dividing the positive cases in the total cases.
With the [] you display the result to screen (you don't need this step, if you want to assign the result to a new object).

This is the result I obtained:

    gender    status country  No Yes  rel.freq
 1: Female   Citizen       A 308 200 0.3937008
 2: Female   Citizen       B 291 169 0.3673913
 3: Female   Citizen       C 301 228 0.4310019
 4: Female   Citizen       D 245 189 0.4354839
 5: Female Immigrant       A 107  95 0.4702970
 6: Female Immigrant       B 126  76 0.3762376
 7: Female Immigrant       C 137  70 0.3381643
 8: Female Immigrant       D 129  74 0.3645320
 9:   Male   Citizen       A 237 167 0.4133663
10:   Male   Citizen       B 247 163 0.3975610
11:   Male   Citizen       C 250 171 0.4061758
12:   Male   Citizen       D 230 139 0.3766938
13:   Male Immigrant       A 103  68 0.3976608
14:   Male Immigrant       B 117  63 0.3500000
15:   Male Immigrant       C  93  53 0.3630137
16:   Male Immigrant       D 102  52 0.3376623

CodePudding user response：

Maybe you can make use of pivot-wider.

library(tidyverse)


my_data %>% group_by (gender, status, country, disease) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%")) -> step_1
#> `summarise()` has grouped output by 'gender', 'status', 'country'. You can
#> override using the `.groups` argument.

step_1 |>group_by(country) |>  
  pivot_wider(names_from = disease, 
              values_from = c(n:rel.freq), 
              names_prefix = "disease_") |> 
  mutate(overallPerc = (n_disease_No   n_disease_Yes)/sum(step_1$n))
#> # A tibble: 16 × 8
#> # Groups:   country [4]
#>    gender status    country n_disease_No n_disease_Yes rel.fre…¹ rel.f…² overa…³
#>    <fct>  <fct>     <fct>          <int>         <int> <chr>     <chr>     <dbl>
#>  1 Female Citizen   A                308           200 61%       39%      0.102 
#>  2 Female Citizen   B                291           169 63%       37%      0.092 
#>  3 Female Citizen   C                301           228 57%       43%      0.106 
#>  4 Female Citizen   D                245           189 56%       44%      0.0868
#>  5 Female Immigrant A                107            95 53%       47%      0.0404
#>  6 Female Immigrant B                126            76 62%       38%      0.0404
#>  7 Female Immigrant C                137            70 66%       34%      0.0414
#>  8 Female Immigrant D                129            74 64%       36%      0.0406
#>  9 Male   Citizen   A                237           167 59%       41%      0.0808
#> 10 Male   Citizen   B                247           163 60%       40%      0.082 
#> 11 Male   Citizen   C                250           171 59%       41%      0.0842
#> 12 Male   Citizen   D                230           139 62%       38%      0.0738
#> 13 Male   Immigrant A                103            68 60%       40%      0.0342
#> 14 Male   Immigrant B                117            63 65%       35%      0.036 
#> 15 Male   Immigrant C                 93            53 64%       36%      0.0292
#> 16 Male   Immigrant D                102            52 66%       34%      0.0308
#> # … with abbreviated variable names ¹rel.freq_disease_No,
#> #   ²rel.freq_disease_Yes, ³overallPerc