Home > Net >  Different ways of using dplyr :: filter() to get the same output
Different ways of using dplyr :: filter() to get the same output

Time:12-30

I am currently learning about the tidyverse package using Grolemund and Wickham's brilliant book, R4DS. While playing with the code, I realised that I was getting the same output using slight variations in how I wrote the arguments within the filter() verb. I would like to get confirmation on one of the following (a) if the output I am getting is the exact same for the variations (b) if the output is somehow different but I have not realised that (c) if the output is same but the way to get at it is different

The two variations are as follows:

library(nycflights13)
library(tidyverse)

#Variation 1

flights %>%
  filter(arr_delay >= 120) %>%
  filter(dest == "IAH" | dest == "HOU") %>%
  filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
  filter(month >= 7 & month <= 9)

#Variation 2

flights %>%
  filter(arr_delay >= 120, 
         dest == "IAH" | dest == "HOU",
         carrier == "UA" | carrier == "AA" | carrier == "DL",
         month >= 7 & month <= 9)

For both, I get the same tibble output

# A tibble: 47 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1  2013     7     1     1310           1057       133     1551           1338       133 UA     
 2  2013     7     1     1707           1448       139     1943           1742       121 UA     
 3  2013     7     1     2058           1735       203     2355           2030       205 AA     
 4  2013     7     2     2001           1735       146     2335           2030       185 AA     
 5  2013     7     3     2215           1909       186       45           2200       165 UA     
 6  2013     7     9     1937           1735       122     2240           2030       130 AA     
 7  2013     7    10       40           1909       331      301           2200       301 UA     
 8  2013     7    10     1629           1520        69     2048           1754       174 UA     
 9  2013     7    10     1913           1721       112     2214           2001       133 UA     
10  2013     7    17     1657           1446       131     2007           1745       142 UA     
# ... with 37 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Apologies if there is some mistake in the question format. This is my first time posting here.

CodePudding user response:

The two versions give exactly the same results. We can test this by storing the results of each and using the identical function:

test1 <- flights %>%
  filter(arr_delay >= 120) %>%
  filter(dest == "IAH" | dest == "HOU") %>%
  filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
  filter(month >= 7 & month <= 9)


test2 <- flights %>%
  filter(arr_delay >= 120, 
         dest == "IAH" | dest == "HOU",
         carrier == "UA" | carrier == "AA" | carrier == "DL",
         month >= 7 & month <= 9)

identical(test1, test2)
#> [1] TRUE

They both benchmark similarly too:

library(microbenchmark)

microbenchmark(
  multi_filter = {
  flights %>%
    filter(arr_delay >= 120) %>%
    filter(dest == "IAH" | dest == "HOU") %>%
    filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
    filter(month >= 7 & month <= 9)
}, 

  single_filter = {
  flights %>%
    filter(arr_delay >= 120, 
           dest == "IAH" | dest == "HOU",
           carrier == "UA" | carrier == "AA" | carrier == "DL",
           month >= 7 & month <= 9)
})

#> Unit: milliseconds
#>          expr     min       lq     mean   median       uq     max neval cld
#>  multi_filter 26.7920 27.33855 29.40675 27.70585 28.74525 40.6825   100  a 
#> single_filter 32.0836 32.77430 34.41295 33.26740 33.67100 55.9700   100   b

Calling filter several times actually seems a little faster in this benchmark. However, the difference isn't massive. The flights data frame is large, with over 300,000 rows, so a few milliseconds in a data frame this big is unlikely to translate into a difference in most real-life applications.

In this case, I think it largely comes down to individual preference.

  • Related