Home > Back-end >  subset the data (based on categorical variable) in r, using %in%
subset the data (based on categorical variable) in r, using %in%

Time:06-21

I made an minimum reproducible example

modelcoef <- c( 'model1_1_ef','model1_1_ev1','model1_1_ev2','model2_1_ef','model2_1_ev1','model2_1_ev2')
id <- 1:6
value <- c(3,1,4,6,4,6)

data<-data.frame(modelcoef,id,value)
subset1<- data %>%
  subset(modelcoef %in% c('ev1','ev2'))

# observation 0, so failed. 

I try to subset my data based on the categorical variable "modelcoef". However, that above code does not seem to be work. I want to subset the data-- I want to take a dataset that modelcoef columns contains "ev1" or "ev2".

I can do it manually with this example, but my real data is really huge, I cannot do that manually

CodePudding user response:

library(dplyr)
library(stringr)
library(tidyr)

modelcoef <- c( 'model1_1_ef','model1_1_ev1','model1_1_ev2','model2_1_ef','model2_1_ev1','model2_1_ev2')
id <- 1:6
value <- c(3,1,4,6,4,6)

data <- data.frame(modelcoef,id,value)



# Regular expression solution 
# You can add levels with the | separator (or)
# it can be more dangerous if there is a possibility to have ev1 in other occurences of ev1 ev2 in the initial modelcoef variable
subset1 <- data %>% 
  filter(stringr::str_detect(modelcoef, "ev1|ev2"))

# tidyr better solution

data %>%
  tidyr::separate(modelcoef, into = c("model_id", "unknown_thing", "modelcoef"), sep = "_") %>%  
  filter(modelcoef %in% c("ev1", "ev2"))
#>   model_id unknown_thing modelcoef id value
#> 1   model1             1       ev1  2     1
#> 2   model1             1       ev2  3     4
#> 3   model2             1       ev1  5     4
#> 4   model2             1       ev2  6     6

Created on 2022-06-17 by the reprex package (v2.0.1)

The tidyr::separate() function takes a column as input and separates it as multiple column. Your initial modelcoef column seemed to follow the pattern (model_id)_(number)_(modelcoef), so if it is true for all your data, the solution should work. It just separates your data into 3 separate columns.

Then you can use your newly created variable "modelcoef" to filter for ev1, ev2

CodePudding user response:

You need to use a regex pattern, ev(1|2)in this case:

library(dplyr)
library(stringr)
data %>%
    filter(str_detect(modelcoef, "ev(1|2)"))
 modelcoef id value
1 model1_1_ev1  2     1
2 model1_1_ev2  3     4
3 model2_1_ev1  5     4
4 model2_1_ev2  6     6
  • Related