I am currently learning about the tidyverse package using Grolemund and Wickham's brilliant book, R4DS. While playing with the code, I realised that I was getting the same output using slight variations in how I wrote the arguments within the filter() verb. I would like to get confirmation on one of the following (a) if the output I am getting is the exact same for the variations (b) if the output is somehow different but I have not realised that (c) if the output is same but the way to get at it is different
The two variations are as follows:
library(nycflights13)
library(tidyverse)
#Variation 1
flights %>%
filter(arr_delay >= 120) %>%
filter(dest == "IAH" | dest == "HOU") %>%
filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
filter(month >= 7 & month <= 9)
#Variation 2
flights %>%
filter(arr_delay >= 120,
dest == "IAH" | dest == "HOU",
carrier == "UA" | carrier == "AA" | carrier == "DL",
month >= 7 & month <= 9)
For both, I get the same tibble output
# A tibble: 47 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 7 1 1310 1057 133 1551 1338 133 UA
2 2013 7 1 1707 1448 139 1943 1742 121 UA
3 2013 7 1 2058 1735 203 2355 2030 205 AA
4 2013 7 2 2001 1735 146 2335 2030 185 AA
5 2013 7 3 2215 1909 186 45 2200 165 UA
6 2013 7 9 1937 1735 122 2240 2030 130 AA
7 2013 7 10 40 1909 331 301 2200 301 UA
8 2013 7 10 1629 1520 69 2048 1754 174 UA
9 2013 7 10 1913 1721 112 2214 2001 133 UA
10 2013 7 17 1657 1446 131 2007 1745 142 UA
# ... with 37 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Apologies if there is some mistake in the question format. This is my first time posting here.
CodePudding user response:
The two versions give exactly the same results. We can test this by storing the results of each and using the identical
function:
test1 <- flights %>%
filter(arr_delay >= 120) %>%
filter(dest == "IAH" | dest == "HOU") %>%
filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
filter(month >= 7 & month <= 9)
test2 <- flights %>%
filter(arr_delay >= 120,
dest == "IAH" | dest == "HOU",
carrier == "UA" | carrier == "AA" | carrier == "DL",
month >= 7 & month <= 9)
identical(test1, test2)
#> [1] TRUE
They both benchmark similarly too:
library(microbenchmark)
microbenchmark(
multi_filter = {
flights %>%
filter(arr_delay >= 120) %>%
filter(dest == "IAH" | dest == "HOU") %>%
filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
filter(month >= 7 & month <= 9)
},
single_filter = {
flights %>%
filter(arr_delay >= 120,
dest == "IAH" | dest == "HOU",
carrier == "UA" | carrier == "AA" | carrier == "DL",
month >= 7 & month <= 9)
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> multi_filter 26.7920 27.33855 29.40675 27.70585 28.74525 40.6825 100 a
#> single_filter 32.0836 32.77430 34.41295 33.26740 33.67100 55.9700 100 b
Calling filter
several times actually seems a little faster in this benchmark. However, the difference isn't massive. The flights
data frame is large, with over 300,000 rows, so a few milliseconds in a data frame this big is unlikely to translate into a difference in most real-life applications.
In this case, I think it largely comes down to individual preference.