I am trying to solve the following: here is the top of my df
Col1 Col2
1 Basic ABC
2 B ABCD
3 B abc
4 B ab c
5 B AB12
Col2 is a string column. I now want to convert the strings to unique numbers, based on the specific words
Like this:
Col1 Col2 Col3
1 Basic ABC 123
2 B ABCD 1234
3 B abc 272829
4 B ab c 2728029
5 B AB12 1212
...
As you see, there can be CAPITAL LETTERS, numbers, lower cases, and spaces, that need to be converted to a specific numeric value. It doesn't matter, what numbers are generated, they only need to be unique.
The difficult part is, that I need static numeric IDs but my df is dynamic. Meaning: Strings can be added or removed over time, but if i.e. the string "dog" is added - it gets an ID (i.e. "789") which was and will never be used by another string. So the generated IDs are not influenced by the col2 size, the position of strings in that column or any order - only by the content of a string itself.
Help is much appreciated
CodePudding user response:
If you are just mapping characters within some master vector, then perhaps this:
chrs <- c(LETTERS, letters, 0:9)
quux$Col3 <- sapply(strsplit(quux$Col2, ""), function(z) paste(match(z, chrs, nomatch = 0L), collapse = ""))
quux
# Col1 Col2 Col3
# 1 Basic ABC 123
# 2 B ABCD 1234
# 3 B abc 272829
# 4 B ab c 2728029
# 5 B AB12 125455
or a dplyr variant, if you're already using it (but this varies very little):
library(dplyr)
quux %>%
mutate(Col3 = sapply(strsplit(Col2, ""), function(z) paste(match(z, chrs, nomatch = 0L), collapse = "")))
# Col1 Col2 Col3
# 1 Basic ABC 123
# 2 B ABCD 1234
# 3 B abc 272829
# 4 B ab c 2728029
# 5 B AB12 125455
However, as MrFlick suggested, perhaps what you really need is a hashing function?
sapply(quux$Col2, digest::digest, algo = "sha256")
# ABC
# "8fe32130ce14a3fb071473f9b718e403752f56c0f13081943d126ffb28a7b923"
# ABCD
# "942a0d444d8cf73354e5316517909d5f34b17963214a8f5b271375fe1da43013"
# abc
# "9f7b8da9f3abe2caaf5212f6b224448706de57b3c7b5dda916ee8d6005d9f24b"
# ab c
# "a3f1c49979af0fffa22f68028a42d302e0a675798ac4ac8a76bed392880af8f2"
# AB12
# "9ffbe9825833ab3c6b183f9986ab194a7aefcc06f5c940549a2c799dd4cd15b1"
Data
quux <- structure(list(Col1 = c("Basic", "B", "B", "B", "B"), Col2 = c("ABC", "ABCD", "abc", "ab c", "AB12")), row.names = c("1", "2", "3", "4", "5"), class = "data.frame")
CodePudding user response:
ah r2 beat me but same concept
dd <- read.table(header = TRUE, text = "a Col1 Col2
1 Basic ABC
2 B ABCD
3 B abc
4 B 'ab c'
5 B AB12")
dd$Col2
f <- function(x) {
x <- strsplit(x, '')
sapply(x, function(y)
factor(y, c(' ', LETTERS, letters, 0:9), c(0, 1:26, 27:52, 0:9)) |>
as.character() |> paste0(`...` = _, collapse = ''))
}
f(dd$Col2)
# [1] "123" "1234" "272829" "2728029" "1212"