I am using a dataset of flights. I try to calculate the average arrival and departure delay for different airports (origin). I tried it with the aggregate function:
average_delay <- aggregate(cbind(arr_delay,dep_delay) ~ origin, FUN = mean, data = flights)
print(average_delay, digits = 3)
After that I tried it also with tidyverse:
library(tidyverse)
average_delay_tidy = flights %>%
group_by(origin) %>%
summarise(arr_delay = mean(arr_delay, na.rm = TRUE),
dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
mutate_if(is.numeric, round, digits = 3) %>%
print
However I got different outputs:
Aggregate:
|origin|arr_delay |dep_delay |
:------|:--------:|:--------:|
|EWR | 9.56 | **15.0** |
|JFK | 5.85 | **12.0** |
|LGA | 6.11 | 10.3 |
Tidyverse:
<chr> <dbl> <dbl>
|origin|arr_delay |dep_delay |
:------|:--------:|:--------:|
|EWR | 9.56 | **15.1** |
|JFK | 5.85 | **12.1** |
|LGA | 6.11 | 10.3 |
De difference is small, however, I do not understand how it is possible? Can someone explain to me why these outputs are different? Is it due to the NA values?
Thanks in advance!
CodePudding user response:
The difference is so small that it seemed to be related to rounding as tibble
does some print formatting for display, but in aggregate
, if there are NAs, specify the na.action
argument which by default is na.omit
i.e. it will remove the row completely if there is a single NA element in the row. Change it to NULL
aggregate(cbind(arr_delay,dep_delay) ~ origin, FUN = mean,
na.rm = TRUE, data = flights,
na.action = NULL)
-output
origin arr_delay dep_delay
1 EWR 9.107055 15.10795
2 JFK 5.551481 12.11216
3 LGA 5.783488 10.34688
To confirm if there are NA
elements
colSums(is.na(flights[c('arr_delay', 'dep_delay')]))
arr_delay dep_delay
9430 8255