Home > OS >  Updating factor levels from another table by data.table
Updating factor levels from another table by data.table

Time:11-11

I want to update factor levels of non numeric columns of a table from another,

Here is what I tried;

set.seed(1453)

library(data.table)

bigger_table <- data.table(region = paste0(rep('region_',50),sample(1:4,50,replace=T)),
                           factor_column = factor(sample(LETTERS[1:3],50,replace=T)),
                           numeric_column = rnorm(50,20,2))
                           
subset_table <- bigger_table[region=='region_1']

nonnumeric_column <- names(bigger_table)[sapply(bigger_table,function(x) !is.numeric(x))]
                                                
subset_table[,(nonnumeric_column) := lapply(.SD,function(x) factor(x,levels = unique(bigger_table[,x]))),.SDcols=nonnumeric_column]

but it doesn't work with an error.

in my desired output; at the subset table the region column should be factor and have levels region_1,region_2,region_3,region_4 which are derivated from bigger_table.

Thanks in advance.

CodePudding user response:

Won't factor with levels not offer a simple solution?

subset_table$region <- factor(subset_table$region, levels = unique(bigger_table$region))

If the issue is doing it on multiple columns, then a dplyr solution would be:

library(dplyr)

subset_table <- subset_table |>
  mutate(across(all_of(nonnumeric_column), ~ factor(.x, levels = unique(bigger_table$region))))

Note that your nonnumeric_column includes "factor_column", which doesn't map to the regions and so is changed to all <NA>

CodePudding user response:

In your MWE the easiest solution is to just create a factor in the bigger table and then in each subset of your data, the factor levels are retained.

# Easiest solution: Create a factor in the original table and the subset retains the levels
library(data.table)
set.seed(1453)
bigger_table <- data.table(region = factor(paste0(rep('region_',50),sample(1:4,50,replace=T))),
                           factor_column = factor(sample(LETTERS[1:3],50,replace=T)),
                           numeric_column = rnorm(50,20,2))

subset_table <- bigger_table[region=='region_1']
subset_table$region
#>  [1] region_1 region_1 region_1 region_1 region_1 region_1 region_1 region_1
#>  [9] region_1 region_1 region_1 region_1 region_1 region_1 region_1
#> Levels: region_1 region_2 region_3 region_4

Created on 2021-11-11 by the reprex package (v2.0.1)

If you have to update them by using another table, then you can use the following code. In your case, there was an error, because in unique(bigger_table[,x]), x is not the column name, but the content of that column.

# Update a table with the factor levels of another table
library(data.table)
set.seed(1453)
bigger_table <- data.table(region = paste0(rep('region_',50),sample(1:4,50,replace=T)),
                           factor_column = factor(sample(LETTERS[1:3],50,replace=T)),
                           numeric_column = rnorm(50,20,2))

subset_table <- bigger_table[region=='region_1']
nonnumeric_column <- names(bigger_table)[sapply(bigger_table,function(x) !is.numeric(x))]

for(col in nonnumeric_column) {
  set(subset_table, j = col, value = factor(subset_table[, get(col)], levels = bigger_table[, unique(get(col))]))
}

subset_table$region
#>  [1] region_1 region_1 region_1 region_1 region_1 region_1 region_1 region_1
#>  [9] region_1 region_1 region_1 region_1 region_1 region_1 region_1
#> Levels: region_1 region_3 region_2 region_4

Created on 2021-11-11 by the reprex package (v2.0.1)

  • Related