Convert letters in a column of strings to numbers in R-CodePudding

I am trying to solve the following: here is the top of my df

  Col1    Col2
1 Basic    ABC
2     B   ABCD          
3     B    abc
4     B   ab c
5     B   AB12

Col2 is a string column. I now want to convert the strings to unique numbers, based on the specific words

Like this:

  Col1    Col2       Col3
1 Basic    ABC        123
2     B   ABCD       1234          
3     B    abc     272829
4     B   ab c    2728029
5     B   AB12       1212
...

As you see, there can be CAPITAL LETTERS, numbers, lower cases, and spaces, that need to be converted to a specific numeric value. It doesn't matter, what numbers are generated, they only need to be unique.

The difficult part is, that I need static numeric IDs but my df is dynamic. Meaning: Strings can be added or removed over time, but if i.e. the string "dog" is added - it gets an ID (i.e. "789") which was and will never be used by another string. So the generated IDs are not influenced by the col2 size, the position of strings in that column or any order - only by the content of a string itself.

Help is much appreciated

CodePudding user response：

If you are just mapping characters within some master vector, then perhaps this:

chrs <- c(LETTERS, letters, 0:9)
quux$Col3 <- sapply(strsplit(quux$Col2, ""), function(z) paste(match(z, chrs, nomatch = 0L), collapse = ""))
quux
#    Col1 Col2    Col3
# 1 Basic  ABC     123
# 2     B ABCD    1234
# 3     B  abc  272829
# 4     B ab c 2728029
# 5     B AB12  125455

or a dplyr variant, if you're already using it (but this varies very little):

library(dplyr)
quux %>%
  mutate(Col3 = sapply(strsplit(Col2, ""), function(z) paste(match(z, chrs, nomatch = 0L), collapse = "")))
#    Col1 Col2    Col3
# 1 Basic  ABC     123
# 2     B ABCD    1234
# 3     B  abc  272829
# 4     B ab c 2728029
# 5     B AB12  125455

However, as MrFlick suggested, perhaps what you really need is a hashing function?

sapply(quux$Col2, digest::digest, algo = "sha256")
#                                                                ABC 
# "8fe32130ce14a3fb071473f9b718e403752f56c0f13081943d126ffb28a7b923" 
#                                                               ABCD 
# "942a0d444d8cf73354e5316517909d5f34b17963214a8f5b271375fe1da43013" 
#                                                                abc 
# "9f7b8da9f3abe2caaf5212f6b224448706de57b3c7b5dda916ee8d6005d9f24b" 
#                                                               ab c 
# "a3f1c49979af0fffa22f68028a42d302e0a675798ac4ac8a76bed392880af8f2" 
#                                                               AB12 
# "9ffbe9825833ab3c6b183f9986ab194a7aefcc06f5c940549a2c799dd4cd15b1"

Data

quux <- structure(list(Col1 = c("Basic", "B", "B", "B", "B"), Col2 = c("ABC", "ABCD", "abc", "ab c", "AB12")), row.names = c("1", "2", "3", "4", "5"), class = "data.frame")

CodePudding user response：

ah r2 beat me but same concept

dd <- read.table(header = TRUE, text = "a Col1    Col2
1 Basic    ABC
2     B   ABCD          
3     B    abc
4     B   'ab c'
5     B   AB12")

dd$Col2


f <- function(x) {
  x <- strsplit(x, '')
  sapply(x, function(y)
    factor(y, c(' ', LETTERS, letters, 0:9), c(0, 1:26, 27:52, 0:9)) |>
      as.character() |> paste0(`...` = _, collapse = ''))
}

f(dd$Col2)
# [1] "123"     "1234"    "272829"  "2728029" "1212"