In my continued exploration of Grolemund and Wickham's book, R4DS, I found that if I want to sequentially select or deselect certain variables, I will have to call select() multiple times through the pipe operator. This is unlike the filter() verb where you could call it just once to filter by all variables in one go. Example of the filter() property I am talking about :
library(nycflights13)
library(tidyverse)
flights %>%
filter(arr_delay >= 120,
dest == "IAH" | dest == "HOU",
carrier == "UA" | carrier == "AA" | carrier == "DL",
month >= 7 & month <= 9)
I tried to do something similar with select()
#Variation 1
flights %>%
select(!dep_time : dep_delay) %>%
select(year : arr_time)
#Output
# A tibble: 336,776 x 4
year month day arr_time
<int> <int> <int> <int>
1 2013 1 1 830
2 2013 1 1 850
3 2013 1 1 923
4 2013 1 1 1004
5 2013 1 1 812
6 2013 1 1 740
7 2013 1 1 913
8 2013 1 1 709
9 2013 1 1 838
10 2013 1 1 753
#Variation 2
flights %>%
select(!dep_time : dep_delay,
year : arr_time)
#Output
# A tibble: 336,776 x 19
year month day arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time
<int> <int> <int> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
1 2013 1 1 830 819 11 UA 1545 N14228 EWR IAH 227
2 2013 1 1 850 830 20 UA 1714 N24211 LGA IAH 227
3 2013 1 1 923 850 33 AA 1141 N619AA JFK MIA 160
4 2013 1 1 1004 1022 -18 B6 725 N804JB JFK BQN 183
5 2013 1 1 812 837 -25 DL 461 N668DN LGA ATL 116
6 2013 1 1 740 728 12 UA 1696 N39463 EWR ORD 150
7 2013 1 1 913 854 19 B6 507 N516JB EWR FLL 158
8 2013 1 1 709 723 -14 EV 5708 N829AS LGA IAD 53
9 2013 1 1 838 846 -8 B6 79 N593JB JFK MCO 140
10 2013 1 1 753 745 8 AA 301 N3ALAA LGA ORD 138
# ... with 336,766 more rows, and 7 more variables: distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, dep_time <int>, sched_dep_time <int>, dep_delay <dbl>
>
From this, I can see that in Variation 2, the second argument of select() is not being executed. I was of the opinion that select() is filter() for columns. Clearly, that's not the case. So, my questions are (a) How could I have written a less verbose code for select() without using multiple pipes, if possible? (b) How is select() different from filter() in what it does (I think the answer depends on how rows and columns are differently structured but my programming knowledge is negligible at best, at this point. If the answer to this question is too large or out of scope of this forum, pointing or linking me to relevant sources will also do).
This question is in continuation of the following, in case anyone wants to know why I came up with this weird comparison : previous question
CodePudding user response:
We can do this in a single step in select
flights %>%
select(year:arr_time, -(dep_time:dep_delay))
-output
# A tibble: 336,776 × 4
year month day arr_time
<int> <int> <int> <int>
1 2013 1 1 830
2 2013 1 1 850
3 2013 1 1 923
4 2013 1 1 1004
5 2013 1 1 812
6 2013 1 1 740
7 2013 1 1 913
8 2013 1 1 709
9 2013 1 1 838
10 2013 1 1 753
# … with 336,766 more rows
The dep_time:dep_delay
is a subset of columns within the year:arr_time
> names(flights)[1:7]
[1] "year" "month" "day" "dep_time" "sched_dep_time" "dep_delay" "arr_time"
So, the -(dep_time:dep_delay)
removes those subset from the already subset of columns
where as if we do !dep_time : dep_delay
, the columns selected will be much more except the dep_time
to dep_delay
, thus when we add the second argument to select the columns, those columns are already found and nothing happens in the same select
flights %>%
select(!dep_time : dep_delay) %>%
names
[1] "year" "month" "day" "arr_time" "sched_arr_time" "arr_delay" "carrier" "flight"
[9] "tailnum" "origin" "dest" "air_time" "distance" "hour" "minute" "time_hour"
Regarding the difference in filter
and select
for variadic (...
) arguments, according to ?filter
... - Expressions that return a logical value, and are defined in terms of the variables in .data. If multiple expressions are included, they are combined with the & operator. Only rows for which all conditions evaluate to TRUE are kept.
Whereas in ?select
, it is not the case and is evaluated from left to right
... - One or more unquoted expressions separated by commas.