I have a dataset, espana2015, of a country with schools, students…. I want to eliminate schools with less than 20 students. The variable of the schools is CNTSCHID
dim(espana2015)
[1] 6736 106
The only way, long, manual and not very efficient, is to write one by one the schools. Here are only 13 schools with less than 20 students, but what if there are many more, e.g. more than 100 schools?
espana2015 %>% group_by(CNTSCHID) %>% summarise(students=n())%>%
filter(students < 20) %>% select (CNTSCHID) ->removeSch
removeSch
# A tibble: 13 x 1
CNTSCHID
<dbl>
1 72400046
2 72400113
3 72400261
4 72400314
5 72400396
6 72400472
7 72400641
8 72400700
9 72400711
10 72400736
11 72400909
12 72400927
13 72400979
espana2015 %>% subset(!CNTSCHID %in% c(72400046,72400113,72400261,
72400314,72400396,72400472,
72400641,72400700,72400711,
72400736,72400909,72400927,
72400979)) -> new_espana2015
Please help me to do it better Walter
CodePudding user response:
Lacking sample data, I'll demonstrate on mtcars
, where my cyl
is your CNTSHID
.
library(dplyr)
table(mtcars$cyl)
# 4 6 8
# 11 7 14
mtcars %>%
group_by(cyl) %>%
filter(n() > 10) %>%
ungroup()
# # A tibble: 25 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 2 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 3 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 4 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 5 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
# 6 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
# 7 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
# 8 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
# 9 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
# 10 10.4 8 460 215 3 5.42 17.8 0 0 3 4
# # ... with 15 more rows
This works because the conditional in filter
resolves to a single logical, and that length-1 true/false is then recycled for all rows in that group. That is, for cyl == 4
, (n() > 10)
--> (11 > 10)
--> TRUE
, so the filter is %>% filter(TRUE)
; the dplyr::filter
function does "safe recycling" in a sense, where the conditional must be the same length as the number of rows, or length 1. When it is length 1, it is essentially saying "all or nothing".
CodePudding user response:
Yes, my first option was to do it with filter with (n), but it didn't work, because I hadn't placed the ungroup() instruction. So I started to doubt everything. Thank you all very much, I lost several hours in this...