Updating factor levels from another table by data.table-CodePudding

I want to update factor levels of non numeric columns of a table from another,

Here is what I tried;

set.seed(1453)

library(data.table)

bigger_table <- data.table(region = paste0(rep('region_',50),sample(1:4,50,replace=T)),
                           factor_column = factor(sample(LETTERS[1:3],50,replace=T)),
                           numeric_column = rnorm(50,20,2))
                           
subset_table <- bigger_table[region=='region_1']

nonnumeric_column <- names(bigger_table)[sapply(bigger_table,function(x) !is.numeric(x))]
                                                
subset_table[,(nonnumeric_column) := lapply(.SD,function(x) factor(x,levels = unique(bigger_table[,x]))),.SDcols=nonnumeric_column]

but it doesn't work with an error.

in my desired output; at the subset table the region column should be factor and have levels region_1,region_2,region_3,region_4 which are derivated from bigger_table.

Thanks in advance.

CodePudding user response：

Won't factor with levels not offer a simple solution?

subset_table$region <- factor(subset_table$region, levels = unique(bigger_table$region))

If the issue is doing it on multiple columns, then a dplyr solution would be:

library(dplyr)

subset_table <- subset_table |>
  mutate(across(all_of(nonnumeric_column), ~ factor(.x, levels = unique(bigger_table$region))))

Note that your nonnumeric_column includes "factor_column", which doesn't map to the regions and so is changed to all <NA>

CodePudding user response：

In your MWE the easiest solution is to just create a factor in the bigger table and then in each subset of your data, the factor levels are retained.

# Easiest solution: Create a factor in the original table and the subset retains the levels
library(data.table)
set.seed(1453)
bigger_table <- data.table(region = factor(paste0(rep('region_',50),sample(1:4,50,replace=T))),
                           factor_column = factor(sample(LETTERS[1:3],50,replace=T)),
                           numeric_column = rnorm(50,20,2))

subset_table <- bigger_table[region=='region_1']
subset_table$region
#>  [1] region_1 region_1 region_1 region_1 region_1 region_1 region_1 region_1
#>  [9] region_1 region_1 region_1 region_1 region_1 region_1 region_1
#> Levels: region_1 region_2 region_3 region_4

^{Created on 2021-11-11 by the reprex package (v2.0.1)}

If you have to update them by using another table, then you can use the following code. In your case, there was an error, because in unique(bigger_table[,x]), x is not the column name, but the content of that column.

# Update a table with the factor levels of another table
library(data.table)
set.seed(1453)
bigger_table <- data.table(region = paste0(rep('region_',50),sample(1:4,50,replace=T)),
                           factor_column = factor(sample(LETTERS[1:3],50,replace=T)),
                           numeric_column = rnorm(50,20,2))

subset_table <- bigger_table[region=='region_1']
nonnumeric_column <- names(bigger_table)[sapply(bigger_table,function(x) !is.numeric(x))]

for(col in nonnumeric_column) {
  set(subset_table, j = col, value = factor(subset_table[, get(col)], levels = bigger_table[, unique(get(col))]))
}

subset_table$region
#>  [1] region_1 region_1 region_1 region_1 region_1 region_1 region_1 region_1
#>  [9] region_1 region_1 region_1 region_1 region_1 region_1 region_1
#> Levels: region_1 region_3 region_2 region_4

^{Created on 2021-11-11 by the reprex package (v2.0.1)}