Different ways of using dplyr :: filter() to get the same output-CodePudding

I am currently learning about the tidyverse package using Grolemund and Wickham's brilliant book, R4DS. While playing with the code, I realised that I was getting the same output using slight variations in how I wrote the arguments within the filter() verb. I would like to get confirmation on one of the following (a) if the output I am getting is the exact same for the variations (b) if the output is somehow different but I have not realised that (c) if the output is same but the way to get at it is different

The two variations are as follows:

library(nycflights13)
library(tidyverse)

#Variation 1

flights %>%
  filter(arr_delay >= 120) %>%
  filter(dest == "IAH" | dest == "HOU") %>%
  filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
  filter(month >= 7 & month <= 9)

#Variation 2

flights %>%
  filter(arr_delay >= 120, 
         dest == "IAH" | dest == "HOU",
         carrier == "UA" | carrier == "AA" | carrier == "DL",
         month >= 7 & month <= 9)

For both, I get the same tibble output

# A tibble: 47 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1  2013     7     1     1310           1057       133     1551           1338       133 UA     
 2  2013     7     1     1707           1448       139     1943           1742       121 UA     
 3  2013     7     1     2058           1735       203     2355           2030       205 AA     
 4  2013     7     2     2001           1735       146     2335           2030       185 AA     
 5  2013     7     3     2215           1909       186       45           2200       165 UA     
 6  2013     7     9     1937           1735       122     2240           2030       130 AA     
 7  2013     7    10       40           1909       331      301           2200       301 UA     
 8  2013     7    10     1629           1520        69     2048           1754       174 UA     
 9  2013     7    10     1913           1721       112     2214           2001       133 UA     
10  2013     7    17     1657           1446       131     2007           1745       142 UA     
# ... with 37 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Apologies if there is some mistake in the question format. This is my first time posting here.

CodePudding user response：

The two versions give exactly the same results. We can test this by storing the results of each and using the identical function:

test1 <- flights %>%
  filter(arr_delay >= 120) %>%
  filter(dest == "IAH" | dest == "HOU") %>%
  filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
  filter(month >= 7 & month <= 9)


test2 <- flights %>%
  filter(arr_delay >= 120, 
         dest == "IAH" | dest == "HOU",
         carrier == "UA" | carrier == "AA" | carrier == "DL",
         month >= 7 & month <= 9)

identical(test1, test2)
#> [1] TRUE

They both benchmark similarly too:

library(microbenchmark)

microbenchmark(
  multi_filter = {
  flights %>%
    filter(arr_delay >= 120) %>%
    filter(dest == "IAH" | dest == "HOU") %>%
    filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
    filter(month >= 7 & month <= 9)
}, 

  single_filter = {
  flights %>%
    filter(arr_delay >= 120, 
           dest == "IAH" | dest == "HOU",
           carrier == "UA" | carrier == "AA" | carrier == "DL",
           month >= 7 & month <= 9)
})

#> Unit: milliseconds
#>          expr     min       lq     mean   median       uq     max neval cld
#>  multi_filter 26.7920 27.33855 29.40675 27.70585 28.74525 40.6825   100  a 
#> single_filter 32.0836 32.77430 34.41295 33.26740 33.67100 55.9700   100   b

Calling filter several times actually seems a little faster in this benchmark. However, the difference isn't massive. The flights data frame is large, with over 300,000 rows, so a few milliseconds in a data frame this big is unlikely to translate into a difference in most real-life applications.

In this case, I think it largely comes down to individual preference.