I have this dataset:
col_1 = as.factor(c("a", "a", "b", "c", "b", "a"))
col_2 = c(15, 346, 3564, 99, 10, 2)
col_3 = as.factor(c("bb", "a", "g", "f", "bb", "a"))
index = 1:6
sample_data = data.frame(index, col_1, col_2, col_3)
index col_1 col_2 col_3
1 a 15 bb
2 a 346 a
3 b 3564 g
4 c 99 f
5 b 10 bb
6 a 2 a
For each factor variable, I want to replace each value with a unique number. The end goal would look something like this:
col_1 = as.factor(c("1", "1", "2", "3", "2", "1"))
col_2 = c(15, 346, 3564, 99, 10, 2)
col_3 = as.factor(c("4", "5", "6", "7", "4", "5"))
index = 1:6
index col_1 col_2 col_3
1 1 15 4
2 1 346 5
3 2 3564 6
4 3 99 7
5 2 10 4
6 1 2 5
I have been thinking of "creative ways" to do this. For example:
col_1 = unique(sample_data$col_1)
index_col_1 = 1:length(unique_col_1)
col_1_data = data.frame(col_1, index_col_1)
max_1 = max(index_col_1)
col_3 = unique(sample_data$col_3)
max_3 = length(col_3)
index_col_3 = seq( max_1 1, max_3 max_1 , by = 1)
col_2_data = data.frame(col_3, index_col_3)
merge_1 = merge(x=sample_data,y=col_1_data, by = c("col_1","col_1"))
merge_2 = merge(x = merge_1, y = col_2_data, by = c("col_3","col_3"))
# final result
output = merge_2[,c(3,5,4,6)]
index index_col_1 col_2 index_col_3
1 2 1 346 5
2 6 1 2 5
3 1 1 15 4
4 5 2 10 4
5 4 3 99 7
6 3 2 3564 6
- In the end, this is a very slow way to do things (e.g. imagine if there were many columns) - can someone please suggest some ways to speed this up?
Thank you!
CodePudding user response:
in Base R
indx <- vapply(sample_data, is.factor, logical(1))
vec <- interaction(stack(type.convert(sample_data[,indx], as.is = TRUE)))
sample_data[indx] <- match(vec, unique(vec))
sample_data
index col_1 col_2 col_3
1 1 1 15 4
2 2 1 346 5
3 3 2 3564 6
4 4 3 99 7
5 5 2 10 4
6 6 1 2 5