how to calculate many percentages at once without making your script too big-CodePudding

The mtcars dataset contains the variable "carb" with the number of carburetors. First I want to find out how many cars have 1, 2, 3, etc. carburetors. I used the dplyr verb count().

library(dplyr)

df <- mtcars 

N <- df %>%
  count(carb)

which results in:

Then I want to know, how many cars with 1 carb, with 2 carbs, witch 3 etc. have either 4, 6, or 8 cylinders.

For example: I used filter() to find out the total number of cars with 1 carb and 4 cylinders by using:

carb1cyl4 <- df %>%
  filter(carb == 1, cyl == 4) %>%
  count() %>%
  rename(carb1cyl4 = n)

which results in:

  carb1cyl4
1         5

I did the same for 6 and 8 cylinders with following results:


  carb1cyl6
1         2
  carb1cyl8
1         0

If I continue this for all carbs, I could do some _rows and _cols binding and then calculate the percentage of cars with a certain number of carbs and cyls by using mutate(carbXcylX / N), so basically dividing the amount of cars for each carb / cyl combination by the amount of cars with the corresponding number of carbs.

Problem is, my dataset is much much larger and it would take ages plus make it vulnerable to mistakes, if I would continue this route. Is there another way to calculate this?

A glimpse of the final outcome should look like this.

  carb  n  perc1cy4  perc1cy6 perc1cy8
1    1  7 0.7142857 0.2857143        0

Thank you in advance!

CodePudding user response：

Using table:

cbind(n = table(mtcars$carb),
      prop.table(with(mtcars, table(carb, cyl)), margin = 1))
#    n         4         6   8
# 1  7 0.7142857 0.2857143 0.0
# 2 10 0.6000000 0.0000000 0.4
# 3  3 0.0000000 0.0000000 1.0
# 4 10 0.0000000 0.4000000 0.6
# 6  1 0.0000000 1.0000000 0.0
# 8  1 0.0000000 0.0000000 1.0

CodePudding user response：

What I'd probably suggest is making a group size column with something like

count_df <- df %>% count(carb, cyl) %>% rename(n = group_size)

Then you can inner join that to the table

inner_join(df, count_df, by = c("carb", "cyl")

Then calculate percentage with

mutate(perc = (n/group_size) * 100)

CodePudding user response：

This can be made more succinct, but here's a starting point, using summarise

mtcars %>%
  group_by(carb) %>%
  summarise(n(),
            sum(cyl == 4),
            sum(cyl == 6),
            sum(cyl == 8),
            mean(cyl == 4),
            mean(cyl == 6),
            mean(cyl == 8))

#> # A tibble: 6 x 8
#>    carb `n()` `sum(cyl == 4)` `sum(cyl == 6)` `sum(cyl == 8)` `mean(cyl == 4)` `mean(cyl == 6)` `mean(cyl == 8)`
#>   <dbl> <int>           <int>           <int>           <int>            <dbl>            <dbl>            <dbl>
#> 1     1     7               5               2               0            0.714            0.286              0  
#> 2     2    10               6               0               4            0.6              0                  0.4
#> 3     3     3               0               0               3            0                0                  1  
#> 4     4    10               0               4               6            0                0.4                0.6
#> 5     6     1               0               1               0            0                1                  0  
#> 6     8     1               0               0               1            0                0                  1