Assign a default level/label for unstated non-NA values when creating a factor?-CodePudding

The title is a little garbled, but I'm not sure how else to describe it. I'm coming from Stata so still getting the hang of factors.

Basically, I want to be able to assign factor levels and labels, but any that I miss get assigned as a default level/label.

Take the following:

library(dplyr)
dt <- as.data.frame(mtcars)  # load demo data
dt$carb[4:6] <- NA           # set some rows to NA for example

dt <- dt%>%
  mutate(
    carb_f = factor(carb,
                    levels = c(1,2,3,4), 
                    labels = c("One","Two","Three","Four")
                    )
  )

table(dt$carb, dt$carb_f, exclude=NULL)

which yields the following:

       One Two Three Four <NA>
  1      5   0     0    0    0
  2      0   9     0    0    0
  3      0   0     3    0    0
  4      0   0     0   10    0
  6      0   0     0    0    1
  8      0   0     0    0    1
  <NA>   0   0     0    0    3

The unstated 6 and 8 are set to NA in the resultant factor carb_f. Although this is expected behaviour, I want to be able to request something like this:

dt <- dt%>%
  mutate(
    carb_f = factor(carb,
                    levels = c(1,2,3,4), 
                    labels = c("One","Two","Three","Four"),
                    non-na(10,"Unk")   # obvious pseudocode
                    )
  )

to yield this:

       One Two Three Four Unk <NA>
  1      5   0     0    0   0    0
  2      0   9     0    0   0    0
  3      0   0     3    0   0    0
  4      0   0     0   10   0    0
  6      0   0     0    0   1    0
  8      0   0     0    0   1    0
  <NA>   0   0     0    0   0    3

...where the unstated 6 and 8 are assigned to a default level/label of 10 and Unk, but the true NA remain NA.

Is there a way of handling this without explicitly referencing 6 and 8 ?

CodePudding user response：

Just use the same label multiple times.

dt <- transform(dt, carb_f=factor(carb, labels=c('one', 'two', 'three', 'four', 'unk', 'unk')))
table(dt$carb, dt$carb_f, useNA='ifany')
#      one two three four unk <NA>
# 1      5   0     0    0   0    0
# 2      0   9     0    0   0    0
# 3      0   0     3    0   0    0
# 4      0   0     0   10   0    0
# 6      0   0     0    0   1    0
# 8      0   0     0    0   1    0
# <NA>   0   0     0    0   0    3

Note: I omitted the levels= attribute since the automatic alphabetical ordering is sufficient. However it can be very helpful if we want different order, e.g. levels=c(2, 1, 3, 4, 6, 8) to use 2 as the first (hence reference) level; further note, that levels and labels correspond in their positions.

To avoid typing the label multiple times, combine the respective levels into a new level, higher as all others, e.g. Inf and use factor in a second step. This can easily be done using within.

dt <- within(dt, {
  carb_f <- ifelse(carb %in% c(6, 8), Inf, carb)
  carb_f <- factor(carb_f, labels=c('one', 'two', 'three', 'four', 'unk'))
})

table(dt$carb, dt$carb_f, useNA='ifany')
#      one two three four unk <NA>
# 1      5   0     0    0   0    0
# 2      0   9     0    0   0    0
# 3      0   0     3    0   0    0
# 4      0   0     0   10   0    0
# 6      0   0     0    0   1    0
# 8      0   0     0    0   1    0
# <NA>   0   0     0    0   0    3