CD <- filter(CD, AGE <= quantile(AGE,probs=.75))
CD <- filter(CD, AMOUNT <= quantile(AMOUNT,probs=.75))
I'm attempting to remove outliers. I want to remove outliers for these two variables (possibly more).
Instead of removing outliers from one variable and another afterward-- which could potentially remove observations which were never outliers due to filtering a first time -- how can I grab the index placements of these outliers?
This way, I can simply select all indexes that are not included in the list that was just coded.
CodePudding user response:
in base R. This expression returns the index.
# Return index positions
which(CD$AMOUNT <= quantile(CD$AMOUNT, probs = .75))
# Return subsetted table
CD[which(CD$AMOUNT <= quantile(CD$AMOUNT, probs = .75)),]
Here's an example to return only the outliers that are present in all columns based on a condition.
set.seed(10)
CD = data.frame(AMOUNT = runif(100)*100,
AGE = runif(100)*25,
RATE = runif(100)*20)
# Return all the indexes that match the condition below
ids = sort(Reduce(union, lapply(CD, function(x) {
which(x <= quantile(x, probs = .50))
})))
# Return the outliers
CD[-ids,]
As you can see below we have the median value for each column.
> lapply(CD, function(x) {quantile(x, probs = .50)})
$AMOUNT
50%
46.25863
$AGE
50%
14.06169
$RATE
50%
12.11707
# The table below satisfies the condition
# where all values in each row are greater than the median of each column.
> CD[-ids,]
AMOUNT AGE RATE
11 65.16557 19.60416 14.26684
12 56.77378 21.06740 13.41244
29 77.07715 21.42183 16.44666
37 82.26526 17.82989 17.90719
40 50.05032 13.75819 14.70880
48 48.61003 15.31302 17.91584
50 80.15470 22.60330 14.68005
72 53.44268 19.04492 19.90746
73 64.13566 18.63244 13.79634
87 79.79930 21.90065 18.58642
94 51.71569 12.93184 19.87585