How to eliminate schools with less than 20 students?-CodePudding

I have a dataset, espana2015, of a country with schools, students…. I want to eliminate schools with less than 20 students. The variable of the schools is CNTSCHID

dim(espana2015)
[1] 6736  106

The only way, long, manual and not very efficient, is to write one by one the schools. Here are only 13 schools with less than 20 students, but what if there are many more, e.g. more than 100 schools?

espana2015 %>% group_by(CNTSCHID) %>% summarise(students=n())%>%
  filter(students < 20)  %>% select (CNTSCHID) ->removeSch

removeSch
# A tibble: 13 x 1
   CNTSCHID
      <dbl>
 1 72400046
 2 72400113
 3 72400261
 4 72400314
 5 72400396
 6 72400472
 7 72400641
 8 72400700
 9 72400711
10 72400736
11 72400909
12 72400927
13 72400979

espana2015 %>% subset(!CNTSCHID %in% c(72400046,72400113,72400261,
                                      72400314,72400396,72400472,
                                      72400641,72400700,72400711,
                                      72400736,72400909,72400927,
                                      72400979)) -> new_espana2015

Please help me to do it better Walter

CodePudding user response：

Lacking sample data, I'll demonstrate on mtcars, where my cyl is your CNTSHID.

library(dplyr)
table(mtcars$cyl)
#  4  6  8 
# 11  7 14 

mtcars %>%
  group_by(cyl) %>%
  filter(n() > 10) %>%
  ungroup()
# # A tibble: 25 x 11
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#  2  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#  3  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#  4  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#  5  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#  6  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#  7  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#  8  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
#  9  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
# 10  10.4     8  460    215  3     5.42  17.8     0     0     3     4
# # ... with 15 more rows

This works because the conditional in filter resolves to a single logical, and that length-1 true/false is then recycled for all rows in that group. That is, for cyl == 4, (n() > 10) --> (11 > 10) --> TRUE, so the filter is %>% filter(TRUE); the dplyr::filter function does "safe recycling" in a sense, where the conditional must be the same length as the number of rows, or length 1. When it is length 1, it is essentially saying "all or nothing".

CodePudding user response：

Yes, my first option was to do it with filter with (n), but it didn't work, because I hadn't placed the ungroup() instruction. So I started to doubt everything. Thank you all very much, I lost several hours in this...