Home > Software engineering >  Error with group_by() %>% group_modify() when applying a custom function
Error with group_by() %>% group_modify() when applying a custom function

Time:11-10

I'm beggining to introduce the tidyverse into my coding skills and I'm running into some trouble when trying to use a custom function in a pipe.

I have a dataset of patient data in two different timepoints. Example data:

dataset <- data.frame(patient_id = rep(1:5, each=6), 
                  timepoint = rep(1:2, 15), 
                  Mean = c(sample(100:130, 25),25,315,46,223,67), 
                  Circ. = sample(40:99, 30)/100,
                  Perim. = sample(1000:2500, 30))

I want to group my data by patient_id and timepoint and then apply to each group a funtion that removes the rows with an outlier value in the Mean column. This is what I wrote:

dataset <- dataset %>% 
           group_by(patient_id, timepoint) %>% 
           group_modify(~rm.outliers(.x,"Mean")) %>% 
           ungroup()

The error I get when running this line is:

Error: Can't subset columns that don't exist. x Locations 41, 119, 124, 112, 130, etc. don't exist. ℹ There are only 1 column.

It makes me think is has something to do with keeping the grouping after removing the outliers, but I don't know how to approach it.

The rm.outliers is a custom function that removes any line with a mean value more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile. It works well for a single dataframe but I'm not very used to writting funcions so there may be some mistakes here:

rm.outliers <- function(data, column){
  Q <- quantile(data[,c(column)],  probs=c(.25, .75), na.rm = FALSE)
  iqr <- IQR(data[,c(column)])
  up <-  Q[2] 1.5*iqr # Upper Range  
  low<- Q[1]-1.5*iqr # Lower Range
  data <-  data[data[,c(column)] < up & data[,c(column)] > low, ]
  data
}

What am I doing wrong? Is there a better way of doing this using tidyverse?

Thanks for any help you can offer

CodePudding user response:

I would suggest to return logical values from rm.outliers function and use it in filter.

library(dplyr)

rm.outliers <- function(data){
  Q <- quantile(data,  probs=c(.25, .75), na.rm = FALSE)
  iqr <- IQR(data)
  up <-  Q[2] 1.5*iqr # Upper Range  
  low<- Q[1]-1.5*iqr # Lower Range
  data < up & data > low
}

dataset %>% 
  group_by(patient_id, timepoint) %>% 
  filter(rm.outliers(Mean)) %>%
  ungroup()
  • Related