How to get a conditional proportion in a tibble in r-CodePudding

I have this tibble

       host_id district                   availability_365
         <dbl>    <chr>                       <dbl>
     1    8573 Fatih                          280
     2    3725 Maltepe                        365
     3    1428 Fatih                          355
     4    6284 Fatih                          164
     5    3518 Esenyurt                       0
     6    8427 Esenyurt                       153
     7    4218 Fatih                          0
     8    5342 Kartal                         134
     9    4297 Pendik                         0
    10    9340 Maltepe                        243
    # … with 51,342 more rows

I want to find out how high the proportion of the hosts (per district) is which have all their rooms on availability_365 == 0. As you can see there are 51352 rows but there aren't different hosts in all rows. There are actually exactly 37572 different host_ids.

I know that I can use the command group_by(district) to get it split up into the 5 different districts but I am not quite sure how to solve the issue to find out how many percent of the hosts only have rooms with no availability. Anybody can help me out here?

CodePudding user response：

Use summarise() function along with group_by() in dplyr.

library(dplyr)

df %>% 
  group_by(district) %>%  
  summarise(Zero_Availability = sum(availability_365==0)/n())

# A tibble: 5 x 2
  district Zero_Availability
  <chr>                <dbl>
1 Esenyurt              0.5 
2 Fatih                 0.25
3 Kartal                0   
4 Maltepe               0   
5 Pendik                1

CodePudding user response：

It's difficult to make sure my answer is working without actually having the data, but if you're open to using data.table, the following should work

library(data.table)

setDT(data)
data[, .(no_avail = all(availability_365 == 0)), .(host_id, district)][, .(
  prop_no_avail = sum(no_avail) / .N
), .(district)]