Home > Blockchain >  Different ways to trim the mean with different results
Different ways to trim the mean with different results

Time:06-15

Suppose I want to calculate the trimmed mean of the following data:

set.seed(1)
s<-rnorm(10,100,20)
#[1]  87.47092 103.67287  83.28743 131.90562 106.59016  83.59063 109.74858 114.76649 111.51563  93.89223

I can use the mean function with a trim parameter of, say, 0.05, which gives a trimmed mean of 102.6441.

mean(s, trim = 0.05)
# 102.6441

However, if I decide to trim the mean by manually using only data that lies between the 0.05 quantile and the 0.95 quantile, I get a trimmed mean of 101.4059

mean(s[which(s <= quantile(s, 0.95) & s >= quantile(s, 0.05))])
# 101.4059

Can anyone explain this behaviour? What does the trim parameter in the mean function actually do?

CodePudding user response:

The trim argument according to the document of ?mean:

the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.

Using trim will delete a certain percentage of the dataset's smallest and greates values. So if you want to trim the data between the 0.05 and 0.95 quantile you should use trim = 0.1 like this:

set.seed(1)
s<-rnorm(10,100,20)
mean(s, trim = 0.1)

Output:

[1] 101.4059

CodePudding user response:

The source to the function is available if you type mean.default. What it does is to calculate lo <- floor(n * trim) 1 and hi <- n 1 - lo, then calculate the mean of sorted values from lo to hi. Your dataset is small so with trim = 0.05, you get the entire dataset and the trimmed mean is the same as the full mean.

On the other hand, the quantile() function uses a complicated definition of quantiles that interpolates between values. So your method effectively uses lo <- ceiling(n * trim) 1, since it will certainly drop the smallest and largest values for any positive value of trim.

  • Related