Home > Mobile >  R: Operations on data.table by group, removing outliers
R: Operations on data.table by group, removing outliers

Time:04-27

I have a data.table in which I want to detect the presence of outliers (according to skewness and kurtosis) and, if found, correct them.

To this purpose, when an outlier is detected and var is the highest value, I want to set the highest value in var to be equal to the second highest. Below is a minimal (almost) working example of my code:

`%>%` <- fastpipe::`%>>%`

country <- rep(c("AA", "BB", "CC", "ZZ"), times = 4)
year <- rep(c("2014", "2015", "2016", "2017"), each = 4)
var <- c(NA, rnorm(8, 2, 4), NA, NA, 1, 25, 19, 2, 3)

melted_data <- data.table(country, year, var)

melted_data %>%
  .[, skew := e1071::skewness(var, na.rm = TRUE), by=year] %>%
  .[, kurt:= moments::kurtosis(var, na.rm = TRUE), by=year] %>%
  .[, outliers := kurt>1 || kurt>3.5 & abs(skew)>2, by=year] %>%

  # Ranks
  .[, rank_high_first := as.integer(frank(-var, na.last="keep", ties.method="min")), by=year] %>%
  .[, rank_low_first := as.integer(frank(var, na.last="keep", ties.method="min")), by=year]  %>%

  # Identify and correct outliers
  .[rank_high_first==1, highest1 := var, by=year] %>%
  .[rank_high_first==2, highest2 := var, by=year] %>%
  .[rank_low_first==1, lowest1 := var, by=year] %>%
  .[rank_low_first==2, lowest2 := var, by=year] %>%
  .[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year]

What I am trying to achieve is all in the last row. However, this does not work because the values highest1 and highest2 do not span the whole year group (edit: see also screenshot below). I think the solution would be to modify the following lines

.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%

so that highest1 and highest2 are copied to all rows in that year. How can I achieve that?

EDIT: Solved the first part of my problem by rewriting the last lines this way:

      .[, highest1 := head(sort(var, decreasing=TRUE), 1), by=year] %>%
      .[, highest2 := head(unique(sort(var, decreasing=TRUE)), 2)[2], by=year] %>% # unique() is to account for ties
      .[, lowest1 := head(sort(var, decreasing=FALSE), 1), by=year] %>%
      .[, lowest2 := head(unique(sort(var, decreasing=FALSE)), 2)[2], by=year] %>%
      .[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year] %>% # This row (and following one) do not give desired result
      .[outliers==TRUE & skew<0 & var==lowest1, var<-lowest2, by=year]

First four rows give me what I want, but the last two do not.

CodePudding user response:

In the identify and correct outliers section, I believe you can make the following change:

f <- function(v,r,i) v[r==i & !is.na(r)]
melted_data[, `:=`(
  highest1 = f(var,rank_high_first,1),highest2=f(var,rank_high_first,2),
  lowest1 = f(var,rank_low_first,1),lowest2=f(var,rank_low_first,2)
),by=year]

Also, I was wondering about your initial definition of outliers. Should the parentheses be added as below?:

  .[, outliers := kurt>1 | (kurt>3.5 & abs(skew)>2), by=year]

The problem you are seeing in the last two lines, I believe can be solved by:

  .[outliers==TRUE & skew>0 & var==highest1, var:=highest2] %>%
  .[outliers==TRUE & skew<0 & var==lowest1, var:=lowest2]

Note: You do not need by=year here, and you should use := instead of <-

  • Related