Suppose I want to calculate the trimmed mean of the following data:
set.seed(1)
s<-rnorm(10,100,20)
#[1] 87.47092 103.67287 83.28743 131.90562 106.59016 83.59063 109.74858 114.76649 111.51563 93.89223
I can use the mean function with a trim parameter of, say, 0.05, which gives a trimmed mean of 102.6441.
mean(s, trim = 0.05)
# 102.6441
However, if I decide to trim the mean by manually using only data that lies between the 0.05 quantile and the 0.95 quantile, I get a trimmed mean of 101.4059
mean(s[which(s <= quantile(s, 0.95) & s >= quantile(s, 0.05))])
# 101.4059
Can anyone explain this behaviour? What does the trim parameter in the mean function actually do?
CodePudding user response:
The trim
argument according to the document of ?mean
:
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.
Using trim
will delete a certain percentage of the dataset's smallest and greates values. So if you want to trim the data between the 0.05 and 0.95 quantile you should use trim = 0.1
like this:
set.seed(1)
s<-rnorm(10,100,20)
mean(s, trim = 0.1)
Output:
[1] 101.4059
CodePudding user response:
The source to the function is available if you type mean.default
. What it does is to calculate lo <- floor(n * trim) 1
and hi <- n 1 - lo
, then calculate the mean of sorted values from lo
to hi
. Your dataset is small so with trim = 0.05
, you get the entire dataset and the trimmed mean is the same as the full mean.
On the other hand, the quantile()
function uses a complicated definition of quantiles that interpolates between values. So your method effectively uses lo <- ceiling(n * trim) 1
, since it will certainly drop the smallest and largest values for any positive value of trim
.