Home > Software engineering >  Using numerical variables as by group to get summary statistics
Using numerical variables as by group to get summary statistics

Time:11-16

I have data as follows:

library(data.table)
dat <- fread("total women young
              1       0      0
              1       1      1
              1       0      1
              2       1      1
              2       2      1
              2       2      1
              3       1      2
              3       2      3
              3       2      3
              4       4      2
              4       4      3
              4       3      3
              5       5      2
              5       2      3
              5       5      3
              10       4      2
              10       4      3
              20       5      3
             100      10     20")

I would like to create six categories for the variable tot_num:

1,2,3,4,5 and over 5.

I would like to count the observations per category total in count. sum_tot would simply be these multiplied. And womenand young are the average amount of women and young people in that group.

Desired output

            total count sum_tot_count women young
              1       3      3          0.33   0.66
              2       3      6          5/6    0.5
              3       3      9          5/9    8/9
              4       3      12         11/12  10/12
              5       3      15         12/15  8/15
              over 5  4      140        23/140 28/140

I am having some trouble figuring out where to start.

Could someone lead me on the right track?

CodePudding user response:

Does this work:

library(dplyr)
dat %>% mutate(tot = if_else(total > 5, 'over 5', as.character(total))) %>% 
      group_by(tot) %>% summarise(count = n(), sum_tot_count = sum(total), women = sum(women)/sum(total), young = sum(young)/sum(total))
# A tibble: 6 × 5
  tot    count sum_tot_count women young
  <chr>  <int>         <int> <dbl> <dbl>
1 1          3             3 0.333 0.667
2 2          3             6 0.833 0.5  
3 3          3             9 0.556 0.889
4 4          3            12 0.917 0.667
5 5          3            15 0.8   0.533
6 over 5     4           140 0.164 0.2  

CodePudding user response:

With cut:

dat %>% 
  group_by(cutGroup = cut(total, breaks = c(1:6, Inf), labels = c(1:5, "over 5"), include.lowest = TRUE, right = FALSE)) %>% 
  summarise(count = n(),
            sum_tot_count = sum(total),
            women = sum(women) / sum(total),
            young = sum(young) / sum(total))     

CodePudding user response:

A data.table solution. The key is using cut(), as in other answers; after that, basic data.table syntax as in Use data.table to count and aggregate / summarize a column will get you the rest of the way:

dat[, cat := cut(total, breaks = 0.5   c(0:5,Inf), labels = c(1:5, "over 5"))]
      .(count = n())]
dat[,.(count=.N, 
       total = sum(total), 
       women = sum(women)/sum(total),
       young = sum(young)/sum(total)), 
    by = cat]
  • Related