Home > OS >  Get percentage values across multiple columns based on factors given a group by function in R
Get percentage values across multiple columns based on factors given a group by function in R

Time:10-15

I have a df that is only looking at 1 ID with their respective assets:

ID  | Asset | CONF_1 |       CONF_2 |    CONF_3 | 
1       A     PERFECT        HIGH        LOW    
1       B     PERFECT        LOW         LOW
1       C     LOW            HIGH        VERY LOW
1       D     NA             MEDIUM      MEDIUM
1       E     MEDIUM         MEDIUM      PERFECT
1       F     MEDIUM         VERY LOW    NA
1       G     VERY LOW       VERY LOW    VERY LOW
1       H     NA             PERFECT     HIGH

The goal is to reorganize the df so I can break down the % of each confidence level (PERFECT, HIGH, MEDIUM, ETC) given each ID and the 3 Conf fields.

Desired output

ID | CONFIDENCE | CONF_1 % | CONF_2 % | CONF_3 %
1     PERFECT      25 %      12.5 %     12.5 %
1     HIGH         0         25 %       12.5 %
1     MEDIUM       25 %      25 %       12.5 %
1     LOW          12.5 %    12.5 %     25 %
1     VERY LOW     12.5 %    25 %       25 %
1     NA           25 %      0          12.5 %

CodePudding user response:

Grouped by 'ID', summarise across the 'CONF' columns, get the frequency count with table on a factor column with levels specified in the order, and find the proportions

library(dplyr)
df1 %>% 
   group_by(ID) %>% 
   summarise(lvls = c("PERFECT", "HIGH", "MEDIUM", "LOW", "VERY LOW", NA), 
    across(starts_with("CONF"), 
    ~ 100 * proportions(table(factor(., levels = na.omit(lvls)), 
        useNA = "always"))), .groups = 'drop') %>%
   rename(CONFIDENCE = lvls)

-output

# A tibble: 6 × 5
     ID CONFIDENCE CONF_1  CONF_2  CONF_3 
  <int> <chr>      <table> <table> <table>
1     1 PERFECT    25.0    12.5    12.5   
2     1 HIGH        0.0    25.0    12.5   
3     1 MEDIUM     25.0    25.0    12.5   
4     1 LOW        12.5    12.5    25.0   
5     1 VERY LOW   12.5    25.0    25.0   
6     1 <NA>       25.0     0.0    12.5   

--

Or another option is to reshape to 'long' format with pivot_longer, do the count and reshape back to 'wide' format with pivot_wider

library(tidyr)
df1 %>% 
   select(-Asset) %>% 
   pivot_longer(cols = starts_with("CONF"), values_to = 'CONFIDENCE') %>% 
   count(ID, name, CONFIDENCE) %>%
   group_by(ID, name) %>%
   mutate(n = 100 *n/sum(n) ) %>%
   ungroup %>%
   pivot_wider(names_from = name, values_from = n, values_fill = 0)

-output

# A tibble: 6 × 5
     ID CONFIDENCE CONF_1 CONF_2 CONF_3
  <int> <chr>       <dbl>  <dbl>  <dbl>
1     1 LOW          12.5   12.5   25  
2     1 MEDIUM       25     25     12.5
3     1 PERFECT      25     12.5   12.5
4     1 VERY LOW     12.5   25     25  
5     1 <NA>         25      0     12.5
6     1 HIGH          0     25     12.5

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Asset = c("A", 
"B", "C", "D", "E", "F", "G", "H"), CONF_1 = c("PERFECT", "PERFECT", 
"LOW", NA, "MEDIUM", "MEDIUM", "VERY LOW", NA), CONF_2 = c("HIGH", 
"LOW", "HIGH", "MEDIUM", "MEDIUM", "VERY LOW", "VERY LOW", "PERFECT"
), CONF_3 = c("LOW", "LOW", "VERY LOW", "MEDIUM", "PERFECT", 
NA, "VERY LOW", "HIGH")), class = "data.frame", row.names = c(NA, 
-8L))
  • Related