I want to sort a character variable into two categories in a new variable based on conditions, in conditions are not met i want it to return "other".
If variable x cointains 4 character values "A", "B", "C" & "D" I want to sort them into a 2 categories, 1 and 0, in a new variable y, creating a dummy variable
Ideally I want it to look like this
df <- data.frame(x = c("A", "B", "C" & "D")
y <- if x == "A" | "D" then assign 1 in y
if x == "B" | "C" then assign 0 in y
if x == other then assign NA in y
x y
1 "A" 1
2 "B" 0
3 "C" 0
4 "D" 1
library(dplyr)
df <- df %>% mutate ( y =case_when(
(x %in% df == "A" | "D") ~ 1 ,
(x %in% df == "B" | "C") ~ 1,
x %in% df == ~ NA
))
I got this error message
Error: replacement has 3 rows, data has 2
CodePudding user response:
Here's the proper case_when
syntax.
df <- data.frame(x = c("A", "B", "C", "D"))
library(dplyr)
df <- df %>%
mutate(y = case_when(x %in% c("A", "D") ~ 1,
x %in% c("B", "C") ~ 0,
TRUE ~ NA_real_))
df
#> x y
#> 1 A 1
#> 2 B 0
#> 3 C 0
#> 4 D 1
CodePudding user response:
You're combining syntaxes in a way that makes sense in speech but not in code.
Generally you can't use foo == "G" | "H"
. You need to use foo == "G" | foo == "H"
, or the handy shorthand foo %in% c("G", "H")
.
Similarly x %in% df == "A"
doesn't make sense x %in% df
makes sense. df == "A"
makes sense. Putting them together x %in% df == ...
does not make sense to R. (Okay, it does make sense to R, but not the same sense it does to you. R will use its Order of Operations which evaluates x %in% df
first and gets a result
from that, and then checks whether that result == "A"
, which is not what you want.)
Inside a dplyr
function like mutate
, you don't need to keep specifying df
. You pipe in df
and now you just need to use the column x
. x %in% df
looks like you're testing whether the column x
is in the data frame df
, which you don't need to do. Instead use x %in% c("A", "D")
. Aron's answer shows the full correct syntax, I hope this answer helps you understand why.