How best to calculate relative shares of different columns in R?-CodePudding

Below is the sample data and code. I have two issues. First, I need the indtotal column to be the sum by the twodigit code and have it stay constant as shown below. The reasons is so that I can do a simple calculation of one column divided by the other to arrive at the smbshare number. When I try the following,

second <- first %>% 
  group_by(twodigit,smb) %>% 
  summarize(indtotal = sum(employment))

it breaks it down by twodigit and smb.
Second issue is having it produce an 0 if the value does not exist. Best example is twodigit code of 51 and smb = 4. When there are not 4 distinct smb values for a given two digit, I am looking for it to produce a 0.

Note: smb is short for small business

 naicstest <- c (512131,512141,521921,522654,512131,536978,541214,531214,621112,541213,551212,574121,569887,541211,523141,551122,512312,521114,522112)
 employment <- c(11,130,315,17,190,21,22,231,15,121,19,21,350,110,515,165,12,110,111)
 smb <- c(1,2,3,1,3,1,1,3,1,2,1,1,4,2,4,3,1,2,2)

 first <- data.frame(naicstest,employment,smb)

 first<-first %>% mutate(twodigit = substr(naicstest,1,2))

 second <- first %>% group_by(twodigit) %>% summarize(indtotal = sum(employment))

Desired result is below

  twodigit      indtotal    smb      smbtotal           smbshare
     51            343        1           23  (11 12)     23/343
     51            343        2           130             130/343
     51            343        3           190             190/343
     51            343        4           0               0/343
     52            1068       1           17              23/1068
     52            1068       2           221 (110 111)   221/1068
     52            1068       3           315             315/1068
     52            1068       4           515              515/1068

CodePudding user response：

This gives you all the columns you need, but in a slightly different order. You could use select or relocate to get them in the order you want I suppose:

first %>%
  group_by(twodigit, smb) %>%
  summarize(smbtotal = sum(employment)) %>%
  ungroup() %>%
  complete(twodigit, smb, fill = list('smbtotal' = 0)) %>%
  group_by(twodigit) %>%
  mutate(
    indtotal = sum(smbtotal), 
    smbshare = smbtotal / indtotal
  ) 

`summarise()` has grouped output by 'twodigit'. You can override using the `.groups` argument.
# A tibble: 32 × 5
# Groups:   twodigit [8]
   twodigit   smb smbtotal indtotal smbshare
   <chr>    <dbl>    <dbl>    <dbl>    <dbl>
 1 51           1       23      343   0.0671
 2 51           2      130      343   0.379 
 3 51           3      190      343   0.554 
 4 51           4        0      343   0     
 5 52           1       17     1068   0.0159
 6 52           2      221     1068   0.207 
 7 52           3      315     1068   0.295 
 8 52           4      515     1068   0.482 
 9 53           1       21      252   0.0833
10 53           2        0      252   0     
# … with 22 more rows