Home > Back-end >  Remove outliers from multiple dataframes
Remove outliers from multiple dataframes

Time:11-04

I have a load of survey data that I need to remove the length outliers from. It looks something like this (but not much like this, a dolphin is unlikely to be 52mm):

Area                    Season  Species Length (mm)
Christchurch            Spring  dolphin 52
Christchurch            Spring  dolphin 54
Christchurch            Spring  dolphin 46
Christchurch            Spring  dolphin 40
Christchurch            Spring  dolphin 38
Christchurch            Autumn  dolphin 52
Christchurch            Autumn  dolphin 54
Christchurch            Autumn  dolphin 46
Christchurch            Autumn  dolphin 40
Christchurch            Autumn  dolphin 38
Christchurch            Spring  ray 52
Christchurch            Spring  ray 54
Christchurch            Spring  ray     46
Christchurch            Spring  ray     40
Christchurch            Spring  ray     38
Christchurch             Autumn ray     52
Christchurch             Autumn ray     54
Christchurch            Autumn  ray     46
Christchurch            Autumn  ray     40
Christchurch            Autumn  ray     38

My problem is I have 6 areas, a range of species at each and about 2000 measurements and I need to remove the length outlier for each species at each season and location. I am very new to r and coding in general so any help is appreciated in making this process more efficient as I am fully aware I have probably not gone about this the most streamlined way.

I have used a loop to subset the original data by area giving me 6 new data frames, and then each of those by season and species which gives me something around 30 data frames.

I now have run out of steam and can't work out how to remove the outliers from each data frame without putting each one in individually in to the code below.

Q<-quantile(au_ray_christ$TOTAL_LENGTH_MM, probs=c(.25,.75),na.rm=FALSE)
iqr<-IQR(au_ray_christ$TOTAL_LENGTH_MM)
au_ray_christ<-subset(au_ray_christ,au_ray_christ$TOTAL_LENGTH_MM >(Q[1]-1.5*iqr) & au_ray_christ$TOTAL_LENGTH_MM < (Q[2] 1.5*iqr))

Any help would be appreciated, I'm more than happy to go back a few steps if it stops me having to copy and paste my life away!

ps. I also eventually need the data to be back in one frame so if I can avoid splitting it all down in anyway that would be great too.

CodePudding user response:

You can also try the r-package outliers together with dplyr:

library(outliers)
library(dplyr)

# I assume dat is the dataframe with your raw data

dat %>% filter(Species=="dolphin") %>% 
  select(Length_mm) %>%
  unlist %>%
  grubbs.test()

dat %>% filter(Species=="ray") %>% 
  select(Length_mm) %>%
  unlist %>%
  grubbs.test()

When using the dataframe of user2974951 with in fact one outlier the result is:

For the dolphins: (no outlier found!)

    Grubbs test for one outlier

data:  .
G.Length_mm7 = 1.20000, U = 0.82222, p-value = 1
alternative hypothesis: highest value 54 is an outlier

For the rays (outlier found!):

    Grubbs test for one outlier

data:  .
G.Length_mm6 = 2.81052, U = 0.13111, p-value = 0.0001611
alternative hypothesis: highest value 100 is an outlier

This solution however does not create such a nice table with a TRUE/FALSE-column...

CodePudding user response:

Here is a base R way to identify "outliers" by area, season, and species (note I added one more row with a very different value since you have no "outliers" in your current data) and not how to remove them

do.call(
  rbind,
  by(
    df,
    list(df$Area,df$Season,df$Species),
    function(x){
      Q<-quantile(x$Length_mm,probs=c(.25,.75),na.rm=FALSE)
      iqr<-IQR(x$Length_mm)
      cbind(
        x,
        "out"=x$Length_mm<(Q[1]-1.5*iqr) | x$Length_mm>(Q[2] 1.5*iqr)
      )
    }
  )
)

resulting in

           Area Season Species Length_mm   out
6  Christchurch Autumn dolphin        52 FALSE
7  Christchurch Autumn dolphin        54 FALSE
8  Christchurch Autumn dolphin        46 FALSE
9  Christchurch Autumn dolphin        40 FALSE
10 Christchurch Autumn dolphin        38 FALSE
1  Christchurch Spring dolphin        52 FALSE
2  Christchurch Spring dolphin        54 FALSE
3  Christchurch Spring dolphin        46 FALSE
4  Christchurch Spring dolphin        40 FALSE
5  Christchurch Spring dolphin        38 FALSE
16 Christchurch Autumn     ray        52 FALSE
17 Christchurch Autumn     ray        54 FALSE
18 Christchurch Autumn     ray        46 FALSE
19 Christchurch Autumn     ray        40 FALSE
20 Christchurch Autumn     ray        38 FALSE
21 Christchurch Autumn     ray       100  TRUE
11 Christchurch Spring     ray        52 FALSE
12 Christchurch Spring     ray        54 FALSE
13 Christchurch Spring     ray        46 FALSE
14 Christchurch Spring     ray        40 FALSE
15 Christchurch Spring     ray        38 FALSE

I would advise against removing these rows based on some arbitrary IQR rule, you can use this new column to inspect your data and figure out why these are "outlying".

  • Related