Let's say I have a data frame df
as follows:
df <- data.frame(type = c("A","B","AB","O","O","B","A"))
Obviously there are 4 kinds of type
. However, in my actual data, I don't know how many kinds are in a column type
. The number of dummy variables should be one less than the number of kinds in type
. In this example, number of dummy variables should be 3. My expected output looks like this:
df <- data.frame(type = c("A","B","AB","O","O","B","A"),
A = c(1,0,0,0,0,0,1),
B = c(0,1,0,0,0,1,0),
AB = c(0,0,1,0,0,0,0))
Here I used A
, B
and AB
as dummy variables, but whatever I choose from type
doesn't matter. Even if I don't know the values of type
and the number of kinds, I somehow want to make it as dummy variables.
CodePudding user response:
The number of dummy variables should be one less than the number of kinds in
type
.
Here I used "A", "B" and "AB" as dummy variables, but whatever I choose from
type
doesn't matter.
Even if I don't know the values in
type
and the number of kinds, I somehow want to make it as dummy variables.
This is treatment contrasts coding. First, you need a factor variable.
## option 1: if you care the order of dummy variables
## the 1st level is not in dummy variables
## I do this to match your example output with "A", "B" and "AB"
f <- factor(df$type, levels = c("O", "A", "B", "AB"))
## option 2: if you don't care, then let R automatically order levels
f <- factor(df$type)
Now, apply treatment contrasts coding.
## option 1 (recommended): using contr.treatment()
m <- contr.treatment(nlevels(f))[f, ]
## option 2 (less efficient): using model.matrix()
m <- model.matrix(~ f)[, -1]
Finally you want to have nice row/column names for readability.
dimnames(m) <- list(1:length(f), levels(f)[-1])
The resulting m
looks like:
# A B AB
#1 1 0 0
#2 0 1 0
#3 0 0 1
#4 0 0 0
#5 0 0 0
#6 0 1 0
#7 1 0 0
This is a matrix. If you want a data frame, do data.frame(m)
.