Home > Net >  Trying to sort character variable into new variable with new value based on conditions
Trying to sort character variable into new variable with new value based on conditions

Time:11-30

I want to sort a character variable into two categories in a new variable based on conditions, in conditions are not met i want it to return "other".

If variable x cointains 4 character values "A", "B", "C" & "D" I want to sort them into a 2 categories, 1 and 0, in a new variable y, creating a dummy variable

Ideally I want it to look like this

df <- data.frame(x = c("A", "B", "C" & "D")

 y <- if x == "A" | "D" then assign 1 in y
 if x == "B" | "C" then assign 0 in y
 if x == other then assign NA in y

    x   y
  1 "A"  1
  2 "B"  0
  3 "C"  0
  4 "D"  1



 library(dplyr)
 df <- df %>% mutate ( y =case_when(
  (x %in% df == "A" | "D") ~ 1 , 
  (x %in% df == "B" | "C") ~ 1,
   x %in% df ==  ~ NA
 ))

I got this error message

Error: replacement has 3 rows, data has 2

CodePudding user response:

Here's the proper case_when syntax.

df <- data.frame(x = c("A", "B", "C", "D"))
 
library(dplyr)

df <- df %>%
  mutate(y = case_when(x %in% c("A", "D") ~ 1,
                       x %in% c("B", "C") ~ 0,
                       TRUE ~ NA_real_))
df
#>   x y
#> 1 A 1
#> 2 B 0
#> 3 C 0
#> 4 D 1

CodePudding user response:

You're combining syntaxes in a way that makes sense in speech but not in code. Generally you can't use foo == "G" | "H". You need to use foo == "G" | foo == "H", or the handy shorthand foo %in% c("G", "H").

Similarly x %in% df == "A" doesn't make sense x %in% df makes sense. df == "A" makes sense. Putting them together x %in% df == ... does not make sense to R. (Okay, it does make sense to R, but not the same sense it does to you. R will use its Order of Operations which evaluates x %in% df first and gets a result from that, and then checks whether that result == "A", which is not what you want.)

Inside a dplyr function like mutate, you don't need to keep specifying df. You pipe in df and now you just need to use the column x. x %in% df looks like you're testing whether the column x is in the data frame df, which you don't need to do. Instead use x %in% c("A", "D"). Aron's answer shows the full correct syntax, I hope this answer helps you understand why.

  • Related