I have a load of survey data that I need to remove the length outliers from. It looks something like this (but not much like this, a dolphin is unlikely to be 52mm):
Area Season Species Length (mm)
Christchurch Spring dolphin 52
Christchurch Spring dolphin 54
Christchurch Spring dolphin 46
Christchurch Spring dolphin 40
Christchurch Spring dolphin 38
Christchurch Autumn dolphin 52
Christchurch Autumn dolphin 54
Christchurch Autumn dolphin 46
Christchurch Autumn dolphin 40
Christchurch Autumn dolphin 38
Christchurch Spring ray 52
Christchurch Spring ray 54
Christchurch Spring ray 46
Christchurch Spring ray 40
Christchurch Spring ray 38
Christchurch Autumn ray 52
Christchurch Autumn ray 54
Christchurch Autumn ray 46
Christchurch Autumn ray 40
Christchurch Autumn ray 38
My problem is I have 6 areas, a range of species at each and about 2000 measurements and I need to remove the length outlier for each species at each season and location. I am very new to r and coding in general so any help is appreciated in making this process more efficient as I am fully aware I have probably not gone about this the most streamlined way.
I have used a loop to subset the original data by area giving me 6 new data frames, and then each of those by season and species which gives me something around 30 data frames.
I now have run out of steam and can't work out how to remove the outliers from each data frame without putting each one in individually in to the code below.
Q<-quantile(au_ray_christ$TOTAL_LENGTH_MM, probs=c(.25,.75),na.rm=FALSE)
iqr<-IQR(au_ray_christ$TOTAL_LENGTH_MM)
au_ray_christ<-subset(au_ray_christ,au_ray_christ$TOTAL_LENGTH_MM >(Q[1]-1.5*iqr) & au_ray_christ$TOTAL_LENGTH_MM < (Q[2] 1.5*iqr))
Any help would be appreciated, I'm more than happy to go back a few steps if it stops me having to copy and paste my life away!
ps. I also eventually need the data to be back in one frame so if I can avoid splitting it all down in anyway that would be great too.
CodePudding user response:
You can also try the r-package outliers
together with dplyr
:
library(outliers)
library(dplyr)
# I assume dat is the dataframe with your raw data
dat %>% filter(Species=="dolphin") %>%
select(Length_mm) %>%
unlist %>%
grubbs.test()
dat %>% filter(Species=="ray") %>%
select(Length_mm) %>%
unlist %>%
grubbs.test()
When using the dataframe of user2974951 with in fact one outlier the result is:
For the dolphins: (no outlier found!)
Grubbs test for one outlier
data: .
G.Length_mm7 = 1.20000, U = 0.82222, p-value = 1
alternative hypothesis: highest value 54 is an outlier
For the rays (outlier found!):
Grubbs test for one outlier
data: .
G.Length_mm6 = 2.81052, U = 0.13111, p-value = 0.0001611
alternative hypothesis: highest value 100 is an outlier
This solution however does not create such a nice table with a TRUE/FALSE-column...
CodePudding user response:
Here is a base R way to identify "outliers" by area, season, and species (note I added one more row with a very different value since you have no "outliers" in your current data) and not how to remove them
do.call(
rbind,
by(
df,
list(df$Area,df$Season,df$Species),
function(x){
Q<-quantile(x$Length_mm,probs=c(.25,.75),na.rm=FALSE)
iqr<-IQR(x$Length_mm)
cbind(
x,
"out"=x$Length_mm<(Q[1]-1.5*iqr) | x$Length_mm>(Q[2] 1.5*iqr)
)
}
)
)
resulting in
Area Season Species Length_mm out
6 Christchurch Autumn dolphin 52 FALSE
7 Christchurch Autumn dolphin 54 FALSE
8 Christchurch Autumn dolphin 46 FALSE
9 Christchurch Autumn dolphin 40 FALSE
10 Christchurch Autumn dolphin 38 FALSE
1 Christchurch Spring dolphin 52 FALSE
2 Christchurch Spring dolphin 54 FALSE
3 Christchurch Spring dolphin 46 FALSE
4 Christchurch Spring dolphin 40 FALSE
5 Christchurch Spring dolphin 38 FALSE
16 Christchurch Autumn ray 52 FALSE
17 Christchurch Autumn ray 54 FALSE
18 Christchurch Autumn ray 46 FALSE
19 Christchurch Autumn ray 40 FALSE
20 Christchurch Autumn ray 38 FALSE
21 Christchurch Autumn ray 100 TRUE
11 Christchurch Spring ray 52 FALSE
12 Christchurch Spring ray 54 FALSE
13 Christchurch Spring ray 46 FALSE
14 Christchurch Spring ray 40 FALSE
15 Christchurch Spring ray 38 FALSE
I would advise against removing these rows based on some arbitrary IQR rule, you can use this new column to inspect your data and figure out why these are "outlying".