Removing extreme values in a dataframe while sorting for multiple columns R-CodePudding

I have a dataframe like this:

mydf <- data.frame(A = c(40,9,55,1,2), B = c(12,1345,112,45,789))
mydf
   A    B
1 40   12
2  9 1345
3 55  112
4  1   45
5  2  789

I want to retain only 95% of the observations and throw out 5% of the data that have extreme values. First, I calculate how many observations they are:

th <- length(mydf$A) * 0.95

And then I want to remove all the rows above the th (or retain the rows below the th, as you wish). I need to sort mydf in an ascending order, to remove only those extreme values. I tried several approaches:

mydf[order(mydf["A"], mydf["B"]),]
mydf[order(mydf$A,mydf$B),]
mydf[with(mydf, order(A,B)), ]
plyr::arrange(mydf,A,B)

but nothing works, so mydf is not sorted in ascending order by the two columns at the same time. I looked here Sort (order) data frame rows by multiple columns but the most common solutions do not work and I don't get why.

However, if I consider only one column at a time (e.g., A), those ordering methods work, but then I don't get how to throw out the extreme values, because this:

mydf <- mydf[(order(mydf$A) < th),]

removes the second row that has a value of 9, while my intent is to subset mydf retaining only the values below threshold (intended in this case as number of observations, not value). I can imagine it is something very simple and basic that I am missing... And probably there are nicer tidyverse approaches.

CodePudding user response：

I think you want rank here, but it doesn't work on multiple columns. To work around that, note that rank(.) is equivalent to order(order(.)):

rank(mydf$A)
# [1] 4 3 5 1 2
order(order(mydf$A))
# [1] 4 3 5 1 2

With that, we can order on both (all) columns, then order again, then compare the resulting ranks with your th value.

mydf[order(do.call(order, mydf)) < th,]
#    A    B
# 1 40   12
# 2  9 1345
# 4  1   45
# 5  2  789

This approach benefits from preserving the natural sort of the rows.

If you would prefer to stick with a single call to order, then you can reorder them and use head:

head(mydf[order(mydf$A, mydf$B),], th)
#    A    B
# 4  1   45
# 5  2  789
# 2  9 1345
# 1 40   12

though this does not preserve the original order of rows (which may or may not be important to you).

CodePudding user response：

Possible approach

An alternative to your approach would be to use a dplyr ranking function such as cume_dist() or percent_rank(). These can accept a dataframe as input and return ranks / percentiles based on all columns.

set.seed(13)
dat_all <- data.frame(
  A = sample(1:60, 100, replace = TRUE),
  B = sample(1:1500, 100, replace = TRUE)
)
nrow(dat_all)
# 100

dat_95 <- dat_all[cume_dist(dat_all) <= .95, ]
nrow(dat_95)
# 95

General cautions about quantiles

More generally, keep in mind that defining quantiles is slippery, as there are multiple possible approaches. You'll want to think about what makes the most sense given your goal. As an example, from the dplyr docs:

cume_dist(x) counts the total number of values less than or equal to x_i, and divides it by the number of observations.

percent_rank(x) counts the total number of values less than x_i, and divides it by the number of observations minus 1.

Some implications of this are that the lowest value is always 1 / nrow() for cume_dist() but 0 for percent_rank(), while the highest value is always 1 for both methods. This means different cases might be excluded depending on the method. It also means the code I provided will always remove the highest-ranking row, which may or may not match your expectations. (e.g., in a vector with just 5 elements, is the highest value "above the 95th percentile"? It depends on how you define it.)