The title is a little garbled, but I'm not sure how else to describe it. I'm coming from Stata so still getting the hang of factors.
Basically, I want to be able to assign factor levels and labels, but any that I miss get assigned as a default level/label.
Take the following:
library(dplyr)
dt <- as.data.frame(mtcars) # load demo data
dt$carb[4:6] <- NA # set some rows to NA for example
dt <- dt%>%
mutate(
carb_f = factor(carb,
levels = c(1,2,3,4),
labels = c("One","Two","Three","Four")
)
)
table(dt$carb, dt$carb_f, exclude=NULL)
which yields the following:
One Two Three Four <NA>
1 5 0 0 0 0
2 0 9 0 0 0
3 0 0 3 0 0
4 0 0 0 10 0
6 0 0 0 0 1
8 0 0 0 0 1
<NA> 0 0 0 0 3
The unstated 6
and 8
are set to NA
in the resultant factor carb_f
. Although this is expected behaviour, I want to be able to request something like this:
dt <- dt%>%
mutate(
carb_f = factor(carb,
levels = c(1,2,3,4),
labels = c("One","Two","Three","Four"),
non-na(10,"Unk") # obvious pseudocode
)
)
to yield this:
One Two Three Four Unk <NA>
1 5 0 0 0 0 0
2 0 9 0 0 0 0
3 0 0 3 0 0 0
4 0 0 0 10 0 0
6 0 0 0 0 1 0
8 0 0 0 0 1 0
<NA> 0 0 0 0 0 3
...where the unstated 6
and 8
are assigned to a default level/label of 10
and Unk
, but the true NA
remain NA
.
Is there a way of handling this without explicitly referencing 6
and 8
?
CodePudding user response:
Just use the same label multiple times.
dt <- transform(dt, carb_f=factor(carb, labels=c('one', 'two', 'three', 'four', 'unk', 'unk')))
table(dt$carb, dt$carb_f, useNA='ifany')
# one two three four unk <NA>
# 1 5 0 0 0 0 0
# 2 0 9 0 0 0 0
# 3 0 0 3 0 0 0
# 4 0 0 0 10 0 0
# 6 0 0 0 0 1 0
# 8 0 0 0 0 1 0
# <NA> 0 0 0 0 0 3
Note: I omitted the levels=
attribute since the automatic alphabetical ordering is sufficient. However it can be very helpful if we want different order, e.g. levels=c(2, 1, 3, 4, 6, 8)
to use 2
as the first (hence reference) level; further note, that levels
and labels
correspond in their positions.
To avoid typing the label multiple times, combine the respective levels into a new level, higher as all others, e.g. Inf
and use factor
in a second step. This can easily be done using within
.
dt <- within(dt, {
carb_f <- ifelse(carb %in% c(6, 8), Inf, carb)
carb_f <- factor(carb_f, labels=c('one', 'two', 'three', 'four', 'unk'))
})
table(dt$carb, dt$carb_f, useNA='ifany')
# one two three four unk <NA>
# 1 5 0 0 0 0 0
# 2 0 9 0 0 0 0
# 3 0 0 3 0 0 0
# 4 0 0 0 10 0 0
# 6 0 0 0 0 1 0
# 8 0 0 0 0 1 0
# <NA> 0 0 0 0 0 3