(I have posted a similar question at Cross Validated, but I believe this is more fitting for Stack Overflow).
I have a large dataframe data
with following columns:
date time orig new
2001-01-01 00:30:00 345 856
2001-01-01 00:32:43 4575 9261
2001-01-01 00:51:07 6453 2352
...
2001-01-01 23:57:51 421 168
2001-01-02 00:06:14 5612 3462
...
2001-01-31 23:49:11 14420 8992
2001-02-01 00:04:32 213 521
...
I want to calculate the monthly aggregated MAD, which can be calculated by mean(abs(orig - new))
when grouped by month
. Ideally, at the end, I want the solutions (dataframe) in a following form:
month mad
2001-01-01 7452.124
2001-02-01 3946.734
2001-03-01 995.938
...
I calculated the monthly MAD in two different ways.
Approach 1
I grouped data
by month and took an average of the summed absolute differences (which is a "mathematical" way to do it, as I explained):
data %>%
group_by(
month = lubridate::floor_date(date, 'month')
) %>%
summarise(mad = mean(abs(orig - new)))
Approach 2
I grouped data
by hour and got the MAD grouped by hour, and then re-grouped it by month and took an average. This is counter-intuitive, but I used the hourly grouped dataframe for other analyses and tried computing the monthly MAD from this dataframe directly.
data_grouped_by_hour <- data %>%
group_by(
day = lubridate::floor_date(date, 'day'),
hour = as.POSIXlt(time)$hour
) %>%
summarise(mad = mean(abs(orig - new)))
data_grouped_by_hour %>%
group_by(
month = lubridate::floor_date(date, 'month')
) %>%
summarise(mad = mean(mad))
As hinted from the post title, these approaches return different values. I assume my first approach is correct, as it is more concise and follows the accurate concept, but I wonder why the second approach does not return the same value.
I want to note that I would prefer Approach 2 so that I don't have to make separate tables for every analysis with different time unit. Any insights are appreciated.
CodePudding user response:
Because average of average is not the same as complete average.
This is a common misconception. Let's try to understand with the help of an example -
Consider a list with 2 elements a
and b
x <- list(a = c(1, 5, 4, 3, 2, 8), b = c(6, 5))
Now, similar to your question we will take average in 2 ways -
- Average of all the values of
x
res1 <- mean(unlist(x))
res1
#[1] 4.25
- Average of each element separately and then complete average.
sapply(x, mean)
# a b
#3.833333 5.500000
res2 <- mean(sapply(x, mean))
res2
#[1] 4.666667
Notice that res1
and res2
has different values because the 2nd case is average of averages.
The same logic applies in your case as well when you take daily average and then monthly which is average of averages.