Home > front end >  fast replacement of data.table values by labels stored in another data.table
fast replacement of data.table values by labels stored in another data.table

Time:03-13

It is related to this question and this other one, although to a larger scale. I have two data.tables:

  • The first one with market research data, containing answers stored as integers;
  • The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.

See reproducible example : EDIT: Addition of a new variable to include the '0' case.

library(data.table)
library(magrittr)

# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
  country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
  gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
  age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
  age_group = as.character(c(1,2,2,NA,NA,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
  status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
  children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)

# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
  label_id = as.character(c(1:9)),
  country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
  gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
  age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
  status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
  children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)

# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")

# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]

You can see that the dictionary table is smaller to the survey data table, this is expected. Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table. My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.

I have actually achieved it using a loop, like the one below:

result_DT1 <- copy(repex_DT) 
for (x in categorical_var){
  result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=labels_DT$label_id, to=labels_DT[[x]], warn_missing = F)]
}

What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records. Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:

test_stringi <- copy(repex_DT) %>% 
  .[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
    str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
    replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
    .SDcols = c("country")]

CodePudding user response:

Columns of your 2nd data.table are just look up vectors:

same_cols <- intersect(names(repex_DT), names(labels_DT))

repex_DT[
  , 
  (same_cols) := mapply(
    function(x, y) y[as.integer(x)], 
    repex_DT[, same_cols, with = FALSE], 
    labels_DT[, same_cols, with = FALSE],
    SIMPLIFY = FALSE
  )
]

edit

you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:

labels_list <- list(
  country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
  gender = c("Male","Female"),
  age_group = c("Less than 35","35 and more"),
  status = c("Employed","Unemployed","Do not want to say"),
  children = c("0","1","2","3","4","5 and more")
)

same_cols <- names(labels_list)

repex_DT[
  , 
  (same_cols) := mapply(
    function(x, y) y[factor(x)], 
    repex_DT[, same_cols, with = FALSE], 
    labels_list,
    SIMPLIFY = FALSE
  )
]

Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...

CodePudding user response:

a very computationally effective way would be to melt your tables first, match them and cast again:

repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
  • Related