how best to calculate this share of a total-CodePudding

Below is the sample data. The goal is to first create a column that contains the total employment for that quarter. Second is to create a new column that shows the relative share for the area. Finally, the last item (and one which is vexing me) is to calculate whether the total with suppress = 0 represents over 50% of the total. I can do this in excel easily but trying to this in R and so have it be something that I can replicate year after year.

desired result is below

  area <- c("001","005","007","009","011","013","015","017","019","021","023","027","033","001","005","007","009","011","013","015","017","019","021","023","027","033")
 year <- c("2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021")
 qtr <- c("01","01","01","01","01","01","01","01","01","01","01","01","01","02","02","02","02","02","02","02","02","02","02","02","02","02")
  employment <- c(2,4,6,8,11,10,12,14,16,18,20,22,30,3,5,8,9,12,9,24,44,33,298,21,26,45)
  suppress <- c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0)

  testitem <- data.frame(year,qtr, area, employment, suppress)

For the first quarter of 2021, the total is 173. If you only take suppress = 1 into account, that is only 24 of 173 hence the TRUE in the 50 percent column. If these two values summed up to 173/2 or greater than you would have it say FALSE. For the second quarter, the suppress = 1 accounts for 310 of the total of 537 and so is over 50% of the total.

For the total column, I am showing the computation or ingredients. Ideally, it would show a value such as .0115 in place of 2/173.

 year    qtr   area     employment   suppress     total       50percent
2021     01     001        2           0          =2/173       TRUE
2021     01     005        4           0          =4/173       TRUE
.....
2021     02     001        3           0          =3/537       FALSE
2021     02     005        5           0          =5/537       FALSE

CodePudding user response：

For example:

library(dplyr)

testitem %>% 
  group_by(year, qtr) %>% 
  mutate(
    total = employment / sum(employment),
    over_half = sum(employment[suppress == 0]) > (0.5 * sum(employment))
  )

Gives:

# A tibble: 26 × 7
# Groups:   year, qtr [2]
   year  qtr   area  employment suppress  total over_half
   <chr> <chr> <chr>      <dbl>    <dbl>  <dbl> <lgl>    
 1 2021  01    001            2        0 0.0116 TRUE     
 2 2021  01    005            4        0 0.0231 TRUE     
 3 2021  01    007            6        0 0.0347 TRUE     
 4 2021  01    009            8        1 0.0462 TRUE     
 5 2021  01    011           11        0 0.0636 TRUE     
 6 2021  01    013           10        0 0.0578 TRUE     
 7 2021  01    015           12        0 0.0694 TRUE     
 8 2021  01    017           14        0 0.0809 TRUE     
 9 2021  01    019           16        1 0.0925 TRUE     
10 2021  01    021           18        0 0.104  TRUE     
# … with 16 more rows
# ℹ Use `print(n = ...)` to see more rows

CodePudding user response：

I think you'll want to use group_by() and mutate() here.

library(dplyr)

testitem |> 
  ## grouping by year and quarter
  ## sums will be calculated over areas
  group_by(year, qtr) |> 
  ## this could be more terse, but gets the job done.
  mutate(total_sum = sum(employment),
         ## This uses the total_sum column that was just created
         total_prop = employment/total_sum,
         ## leveraging the 0,1 coding of suppress
         suppress_sum = sum(suppress * employment),
         suppress_prop = suppress_sum/total,
         fifty = (1-suppress_prop) > 0.5)