Home > OS >  Sequentially Replacing Factor Variables with Numerical Values
Sequentially Replacing Factor Variables with Numerical Values

Time:07-03

I have this dataset:

col_1 = as.factor(c("a", "a", "b", "c", "b", "a"))
col_2 = c(15, 346, 3564, 99, 10, 2)
col_3 = as.factor(c("bb", "a", "g", "f", "bb", "a"))
index = 1:6

sample_data = data.frame(index, col_1, col_2, col_3)

  index col_1 col_2 col_3
     1     a    15    bb
     2     a   346     a
     3     b  3564     g
     4     c    99     f
     5     b    10    bb
     6     a     2     a

For each factor variable, I want to replace each value with a unique number. The end goal would look something like this:

col_1 = as.factor(c("1", "1", "2", "3", "2", "1"))
col_2 = c(15, 346, 3564, 99, 10, 2)
col_3 = as.factor(c("4", "5", "6", "7", "4", "5"))
index = 1:6

  index col_1 col_2 col_3
     1     1    15     4
     2     1   346     5
     3     2  3564     6
     4     3    99     7
     5     2    10     4
     6     1     2     5

I have been thinking of "creative ways" to do this. For example:

col_1 = unique(sample_data$col_1)
index_col_1 = 1:length(unique_col_1)
col_1_data = data.frame(col_1, index_col_1)

max_1 = max(index_col_1)
col_3 = unique(sample_data$col_3)
max_3 = length(col_3)
index_col_3 = seq( max_1 1, max_3   max_1 , by = 1)
col_2_data = data.frame(col_3, index_col_3)


 merge_1 = merge(x=sample_data,y=col_1_data, by = c("col_1","col_1"))
merge_2 = merge(x = merge_1, y = col_2_data, by =  c("col_3","col_3"))



# final result
output = merge_2[,c(3,5,4,6)]

index index_col_1 col_2 index_col_3
1     2           1   346           5
2     6           1     2           5
3     1           1    15           4
4     5           2    10           4
5     4           3    99           7
6     3           2  3564           6
  • In the end, this is a very slow way to do things (e.g. imagine if there were many columns) - can someone please suggest some ways to speed this up?

Thank you!

CodePudding user response:

in Base R

indx <- vapply(sample_data, is.factor, logical(1))
vec <- interaction(stack(type.convert(sample_data[,indx], as.is = TRUE)))
sample_data[indx] <- match(vec, unique(vec))
sample_data

  index col_1 col_2 col_3
1     1     1    15     4
2     2     1   346     5
3     3     2  3564     6
4     4     3    99     7
5     5     2    10     4
6     6     1     2     5
  • Related