Home > OS >  How to specify the integer value of factors in R
How to specify the integer value of factors in R

Time:08-25

suppose I have data that comes encoded in integers. E.g

int_data = c(10,50,60)

And suppose I know, that 10 denotes "Workers", 50 denotes "Farmers", and 60 denotes "Teachers". I now want a factor f with the following attributes:

  1. as.integer(f) is c(10,50,60)
  2. levels(f) is c("Workers","Farmers",Teachers") (in that order!)

With all that I have found so far, I could maybe create a factor with the desired labels, but the internal integer would become c(1,2,3), which I don't want.

Any Suggestions?

EDIT: My main goal is to have the data in a more reader-friendly form (as one would have with factors displaying levels rater than integers) whilst not altering the integer encoding I receive. Since I am working with very large data sets simply having a "clear text duplicate" character column is not feasible.

CodePudding user response:

Here's an example of using a lookup table to enrich your data with a column that augments your coded data with the read-friendly form.

my_data <- data.frame(int_data = c(10, 50, 60, 10, 60, 50))

lookup <- data.frame(int_data = c(10, 50, 60),
                     category = c("Workers", "Farmers", "Teachers"))


my_data_annotated <- dplyr::left_join(my_data, lookup)

my_data_annotated
  int_data category
1       10  Workers
2       50  Farmers
3       60 Teachers
4       10  Workers
5       60 Teachers
6       50  Farmers

CodePudding user response:

If you really wanted to, you could define your own S3 class that meets your specification. Let's say it was called factoroid and behaved like this:

df <- data.frame(
  col = factoroid(c("Workers","Farmers","Teachers"), c(10, 50, 60))
  )

df
#>        col
#> 1  Workers
#> 2  Farmers
#> 3 Teachers

df$col[2:3]
#> [1] Farmers  Teachers

levels(df$col)
#> [1] "Workers"  "Farmers"  "Teachers"

as.integer(df$col)
#> [1] 10 50 60

A very basic implementation would be something like this:

factoroid <- function(labels, values = seq_along(unique(labels))) {
  lab_match <- match(labels, unique(labels))
  if(length(lab_match) != length(unique(values))) {
    stop("Values must have same number of unique elements as Labels")
  }
  structure(unique(values)[lab_match], levels = unique(labels),
            map = lab_match, class = "factoroid")
}

But you would also require as a minimum the following methods defined:

as.character.factoroid <- function(x, ...) {
  attr(x, "levels")[attr(x, "map")]
}

as.numeric.factoroid <- function(x, ...) {
  as.integer(x)
}

format.factoroid <- function(x, ...) {
  as.character(x)
}

print.factoroid <- function(x, quote = FALSE, ...) {
  print(format(x), quote = quote, ...)
}

as.data.frame.factoroid <- function(x, ...) {
  structure(list(x), row.names = seq_along(x), class = "data.frame")
}

`[.factoroid` <- function(x, i, ...) {
  y <- unclass(x)[i]
  labs <- levels(x)[attr(x, "map")[i]]
  factoroid(labs, y)
}

Created on 2022-08-24 with reprex v2.0.2

CodePudding user response:

You can use a named vector to store this data.

int_data <- c(10,50,60)
names(int_data) <- c("Workers", "Farmers", "Teachers")

Which gives:

int_data
 Workers  Farmers Teachers 
      10       50       60 
  •  Tags:  
  • r
  • Related