suppose I have data that comes encoded in integers. E.g
int_data = c(10,50,60)
And suppose I know, that 10
denotes "Workers"
, 50
denotes "Farmers"
, and 60
denotes "Teachers"
.
I now want a factor f
with the following attributes:
as.integer(f)
isc(10,50,60)
levels(f)
isc("Workers","Farmers",Teachers")
(in that order!)
With all that I have found so far, I could maybe create a factor with the desired labels, but the internal integer would become c(1,2,3)
, which I don't want.
Any Suggestions?
EDIT: My main goal is to have the data in a more reader-friendly form (as one would have with factors displaying levels rater than integers) whilst not altering the integer encoding I receive. Since I am working with very large data sets simply having a "clear text duplicate" character column is not feasible.
CodePudding user response:
Here's an example of using a lookup table to enrich your data with a column that augments your coded data with the read-friendly form.
my_data <- data.frame(int_data = c(10, 50, 60, 10, 60, 50))
lookup <- data.frame(int_data = c(10, 50, 60),
category = c("Workers", "Farmers", "Teachers"))
my_data_annotated <- dplyr::left_join(my_data, lookup)
my_data_annotated
int_data category
1 10 Workers
2 50 Farmers
3 60 Teachers
4 10 Workers
5 60 Teachers
6 50 Farmers
CodePudding user response:
If you really wanted to, you could define your own S3 class that meets your specification. Let's say it was called factoroid
and behaved like this:
df <- data.frame(
col = factoroid(c("Workers","Farmers","Teachers"), c(10, 50, 60))
)
df
#> col
#> 1 Workers
#> 2 Farmers
#> 3 Teachers
df$col[2:3]
#> [1] Farmers Teachers
levels(df$col)
#> [1] "Workers" "Farmers" "Teachers"
as.integer(df$col)
#> [1] 10 50 60
A very basic implementation would be something like this:
factoroid <- function(labels, values = seq_along(unique(labels))) {
lab_match <- match(labels, unique(labels))
if(length(lab_match) != length(unique(values))) {
stop("Values must have same number of unique elements as Labels")
}
structure(unique(values)[lab_match], levels = unique(labels),
map = lab_match, class = "factoroid")
}
But you would also require as a minimum the following methods defined:
as.character.factoroid <- function(x, ...) {
attr(x, "levels")[attr(x, "map")]
}
as.numeric.factoroid <- function(x, ...) {
as.integer(x)
}
format.factoroid <- function(x, ...) {
as.character(x)
}
print.factoroid <- function(x, quote = FALSE, ...) {
print(format(x), quote = quote, ...)
}
as.data.frame.factoroid <- function(x, ...) {
structure(list(x), row.names = seq_along(x), class = "data.frame")
}
`[.factoroid` <- function(x, i, ...) {
y <- unclass(x)[i]
labs <- levels(x)[attr(x, "map")[i]]
factoroid(labs, y)
}
Created on 2022-08-24 with reprex v2.0.2
CodePudding user response:
You can use a named vector to store this data.
int_data <- c(10,50,60)
names(int_data) <- c("Workers", "Farmers", "Teachers")
Which gives:
int_data
Workers Farmers Teachers
10 50 60