Home > Mobile >  Is there a way to get outliers out of a column in R?
Is there a way to get outliers out of a column in R?

Time:12-04

I'm trying to get outliers removed from a column of data within my data set in R but the code my professor gave me has been giving me issues. When I run it returns NA for all observations in every single column.

Here is the line of code:

MainData <- MainData[MainData$GDP_2006 < mean(MainData$GDP_2006)   sd(MainData$GDP_2006)*2, ]

Any suggestions or solutions would be heavily appreciated!

CodePudding user response:

I strongly suspect you have issues created by missing data. Execute TRUE %in% is.na(MainData$GDP_2006) — if there are missing values it will return a TRUE.

There are two ways to deal with this - filter out the observations with missing data first, or add na.rm=TRUE on to your mean() and sd() calls. This seems to recreate your problem:

# Create demo data
df1 <- mtcars
df1[1, "mpg"] <- NA

# Problem:
df1[df1$mpg < mean(df1$mpg)   sd(df1$mpg) * 2, ]

There are three general schools of thought on how to approach this task - base R, tidyverse and data.table. Here they are - my personal preference is data.table but tidyverse is extremely popular.

# Base R way ===========================================================
# Solution 1 (use na.rm):
df1[df1$mpg < mean(df1$mpg, na.rm=TRUE)   sd(df1$mpg, na.rm=TRUE) * 2, ]

# Solution 2 (filter out NAs first):
df1 <- df1[!is.na(df1$mpg),]
df1[df1$mpg < mean(df1$mpg)   sd(df1$mpg) * 2, ]


# Tidyverse way ========================================================
# Set up:
library(dplyr)

# Solution 1 (use na.rm):
df1 %>% 
  filter(mpg < mean(mpg, na.rm = TRUE)   sd(mpg, na.rm = TRUE)*2)

# Solution 2 (filter out NAs first):
df1 %>% 
  filter(!is.na(mpg)) %>% 
  filter(mpg < mean(mpg)   sd(mpg)*2)


# Data.table way =======================================================
# Set up:
library(data.table)
setDT(df1, keep.rownames = TRUE)

# Solution 1 (use na.rm):
df1[mpg < mean(mpg, na.rm=TRUE)   sd(mpg, na.rm=TRUE) * 2]

# Solution 2 (filter out NAs first):
df1[!is.na(mpg)][mpg < mean(mpg)   sd(mpg) * 2]
  • Related