Home > Net >  Why do I have to call dplyr :: select() multiple times, unlike filter()?
Why do I have to call dplyr :: select() multiple times, unlike filter()?

Time:12-31

In my continued exploration of Grolemund and Wickham's book, R4DS, I found that if I want to sequentially select or deselect certain variables, I will have to call select() multiple times through the pipe operator. This is unlike the filter() verb where you could call it just once to filter by all variables in one go. Example of the filter() property I am talking about :

library(nycflights13)
library(tidyverse)

flights %>%
  filter(arr_delay >= 120, 
         dest == "IAH" | dest == "HOU",
         carrier == "UA" | carrier == "AA" | carrier == "DL",
         month >= 7 & month <= 9)

I tried to do something similar with select()

#Variation 1

flights %>% 
  select(!dep_time : dep_delay) %>% 
  select(year : arr_time)

#Output

# A tibble: 336,776 x 4
    year month   day arr_time
   <int> <int> <int>    <int>
 1  2013     1     1      830
 2  2013     1     1      850
 3  2013     1     1      923
 4  2013     1     1     1004
 5  2013     1     1      812
 6  2013     1     1      740
 7  2013     1     1      913
 8  2013     1     1      709
 9  2013     1     1      838
10  2013     1     1      753

#Variation 2

flights %>%
  select(!dep_time : dep_delay,
         year : arr_time)

#Output

# A tibble: 336,776 x 19
    year month   day arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
   <int> <int> <int>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1  2013     1     1      830            819        11 UA        1545 N14228  EWR    IAH        227
 2  2013     1     1      850            830        20 UA        1714 N24211  LGA    IAH        227
 3  2013     1     1      923            850        33 AA        1141 N619AA  JFK    MIA        160
 4  2013     1     1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183
 5  2013     1     1      812            837       -25 DL         461 N668DN  LGA    ATL        116
 6  2013     1     1      740            728        12 UA        1696 N39463  EWR    ORD        150
 7  2013     1     1      913            854        19 B6         507 N516JB  EWR    FLL        158
 8  2013     1     1      709            723       -14 EV        5708 N829AS  LGA    IAD         53
 9  2013     1     1      838            846        -8 B6          79 N593JB  JFK    MCO        140
10  2013     1     1      753            745         8 AA         301 N3ALAA  LGA    ORD        138
# ... with 336,766 more rows, and 7 more variables: distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, dep_time <int>, sched_dep_time <int>, dep_delay <dbl>
> 

From this, I can see that in Variation 2, the second argument of select() is not being executed. I was of the opinion that select() is filter() for columns. Clearly, that's not the case. So, my questions are (a) How could I have written a less verbose code for select() without using multiple pipes, if possible? (b) How is select() different from filter() in what it does (I think the answer depends on how rows and columns are differently structured but my programming knowledge is negligible at best, at this point. If the answer to this question is too large or out of scope of this forum, pointing or linking me to relevant sources will also do).

This question is in continuation of the following, in case anyone wants to know why I came up with this weird comparison : previous question

CodePudding user response:

We can do this in a single step in select

flights %>%
     select(year:arr_time, -(dep_time:dep_delay))

-output

# A tibble: 336,776 × 4
    year month   day arr_time
   <int> <int> <int>    <int>
 1  2013     1     1      830
 2  2013     1     1      850
 3  2013     1     1      923
 4  2013     1     1     1004
 5  2013     1     1      812
 6  2013     1     1      740
 7  2013     1     1      913
 8  2013     1     1      709
 9  2013     1     1      838
10  2013     1     1      753
# … with 336,766 more rows

The dep_time:dep_delay is a subset of columns within the year:arr_time

> names(flights)[1:7]
[1] "year"           "month"          "day"            "dep_time"       "sched_dep_time" "dep_delay"      "arr_time"    

So, the -(dep_time:dep_delay) removes those subset from the already subset of columns

where as if we do !dep_time : dep_delay, the columns selected will be much more except the dep_time to dep_delay, thus when we add the second argument to select the columns, those columns are already found and nothing happens in the same select

flights %>%
   select(!dep_time : dep_delay) %>% 
   names
 [1] "year"           "month"          "day"            "arr_time"       "sched_arr_time" "arr_delay"      "carrier"        "flight"        
 [9] "tailnum"        "origin"         "dest"           "air_time"       "distance"       "hour"           "minute"         "time_hour"   

Regarding the difference in filter and select for variadic (...) arguments, according to ?filter

... - Expressions that return a logical value, and are defined in terms of the variables in .data. If multiple expressions are included, they are combined with the & operator. Only rows for which all conditions evaluate to TRUE are kept.

Whereas in ?select, it is not the case and is evaluated from left to right

... - One or more unquoted expressions separated by commas.

  • Related