Home > Blockchain >  Calculate Percentage by Group with multiple columns in R
Calculate Percentage by Group with multiple columns in R

Time:12-20

I have several data frames with monthly data, I would like to find the percentage distribution for each product and for each month. I have problem with multiple columns with months. Currently, I can only get a percentage by group for one month.

data <- data.frame(group = rep(LETTERS[1:3], each = 4),  
                   Product = letters[1:4],
                   January = sample(1:100,12),
                   February = sample(1:100,12))

data_new1 <- transform(data,                             
                       perc = ave(January,
                                  group,
                                  FUN = prop.table))
data_new1$perc<-round(data_new1$perc, 2)

> data_new1
   group Product January February perc
1      A       a      12       16 0.05
2      A       b      73       75 0.32
3      A       c      78       11 0.34
4      A       d      65       35 0.29
5      B       a      86       63 0.36
6      B       b      33       71 0.14
7      B       c      92       49 0.38
8      B       d      30       60 0.12
9      C       a      91       59 0.37
10     C       b      31       45 0.12
11     C       c      99        7 0.40
12     C       d      28       50 0.11

CodePudding user response:

tidyverse

library(dplyr)
data %>% 
  group_by(group) %>% 
  mutate(across(c("January", "February"), proportions, .names = "{.col}_perc")) %>% 
  ungroup()

# A tibble: 12 x 6
   group Product January February January_perc February_perc
   <chr> <chr>     <int>    <int>        <dbl>         <dbl>
 1 A     a            49       40      0.426          0.252 
 2 A     b             1        3      0.00870        0.0189
 3 A     c            19       50      0.165          0.314 
 4 A     d            46       66      0.4            0.415 
 5 B     a            61       82      0.218          0.285 
 6 B     b            88       51      0.314          0.177 
 7 B     c            32       75      0.114          0.260 
 8 B     d            99       80      0.354          0.278 
 9 C     a             6       31      0.0397         0.373 
10 C     b             8        5      0.0530         0.0602
11 C     c            92       20      0.609          0.241 
12 C     d            45       27      0.298          0.325 

base

data <- data.frame(group = rep(LETTERS[1:3], each = 4),  
                   Product = letters[1:4],
                   January = sample(1:100,12),
                   February = sample(1:100,12))

tmp <- sapply(c("January", "February"), function (x) ave(data[[x]], data$group, FUN = prop.table))

colnames(tmp) <- paste0(colnames(tmp), "_perc")

res <- cbind(data, tmp)

res
#>    group Product January February January_perc February_perc
#> 1      A       a      42       73   0.18260870   0.238562092
#> 2      A       b      67       92   0.29130435   0.300653595
#> 3      A       c      58       90   0.25217391   0.294117647
#> 4      A       d      63       51   0.27391304   0.166666667
#> 5      B       a      48       15   0.21621622   0.081521739
#> 6      B       b      16       82   0.07207207   0.445652174
#> 7      B       c      80       75   0.36036036   0.407608696
#> 8      B       d      78       12   0.35135135   0.065217391
#> 9      C       a      81       16   0.32793522   0.117647059
#> 10     C       b      83       81   0.33603239   0.595588235
#> 11     C       c      11        1   0.04453441   0.007352941
#> 12     C       d      72       38   0.29149798   0.279411765

Created on 2021-12-20 by the reprex package (v2.0.1)

data.table

library(data.table)

COLS <- c("January", "February")
COLS_RES <- paste0(COLS, "_perc")
setDT(data)[, (COLS_RES) := lapply(.SD, proportions), by = group, .SDcol = COLS][]

CodePudding user response:

These calculations are easier if your data is structured in a tidy way. In your case, January and February should probably be one single variable called month or something.

Example:

Underneath, I use tidyr::pivot_longer() to combine January and February into one column. Then I use the package dplyr to group the dataframe and calculate perc. I'm not using prop.table(), but I believe you just want the proportion of observation to the total of that group and month.


library(dplyr)
library(tidyr)

# To make the sampling underneath reproducable
set.seed(1)

data <- data.frame(
  group = rep(LETTERS[1:3], each = 4),  
  Product = letters[1:4],
  January = sample(1:100,12),
  February = sample(1:100,12)
)

data %>% 
  pivot_longer(c(January, February), names_to = "month", values_to = "x") %>% 
  group_by(group, month) %>%
  mutate(
    perc = round(x/sum(x), 2)
  )

I hope this is what you were looking for.

CodePudding user response:

Another dplyr solution:

library(dplyr)
data %>% 
  group_by(group) %>% 
  mutate(across(c(2:5), 
                ~./sum(.)*100, .names = "{.col}_pct"))
# A tibble: 12 × 10
# Groups:   group [3]
   group Product   Jan   Feb   Mar   May Jan_pct Feb_pct Mar_pct May_pct
   <chr> <chr>   <int> <int> <int> <int>   <dbl>   <dbl>   <dbl>   <dbl>
 1 A     a          14    14    95    50    8      18.4    44.4    20.9 
 2 A     b         100    33    28    32   57.1    43.4    13.1    13.4 
 3 A     c          11    16    13    95    6.29   21.1     6.07   39.7 
 4 A     d          50    13    78    62   28.6    17.1    36.4    25.9 
 5 B     a          29    42    72    13   22.0    33.9    20.3     7.07
 6 B     b           3     4    88    41    2.27    3.23   24.9    22.3 
 7 B     c          30    68    94    86   22.7    54.8    26.6    46.7 
 8 B     d          70    10   100    44   53.0     8.06   28.2    23.9 
 9 C     a           4    88    45    84    3.96   43.6    24.2    30.7 
10 C     b          52    12    26    55   51.5     5.94   14.0    20.1 
11 C     c          26    20    23    57   25.7     9.90   12.4    20.8 
12 C     d          19    82    92    78   18.8    40.6    49.5    28.5

Data:

data <- data.frame(group = rep(LETTERS[1:3], each = 4),  
                   Product = letters[1:4],
                   Jan = sample(1:100,12),
                   Feb = sample(1:100,12),
                   Mar = sample(1:100, 12),
                   May = sample(1:100, 12))
  • Related