I have 100,000 individuals Using a combination of upper case letters, lower case letters and numbers, I want to create a five-character ID for each individual. I should not have any duplicates. How can I do this? I have tried the code below but I have 4 duplicates.
What is the number of possible unique combinations to create a 5 character ID with "letters"
, "LETTERS"
and "0:9"
?
set.seed(0)
mydata<-data.frame(
ID=rep(NA,10^5),
Poids=rnorm(n=10^5,mean = 65,sd=5)
)
for (i in 1:nrow(mydata)){
mydata$ID[i]<-c(
paste(sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),sep = "")
)
}
table(duplicated(mydata$ID))
FALSE TRUE
99996 4
CodePudding user response:
(length(letters) length(LETTERS) length(0:9))^5
is 91,6132,832, so there is plenty of space to avoid clashes.
In fact, we can use this number to help generate our sample. We draw 100,000 integers out of 91,6132,832 without replacement and interpret each number as its unique string of characters using a bit of modular math and indexing. This can all be done in a single pass:
space <- c(LETTERS, letters, 0:9)
set.seed(0)
samps <- sample(length(space)^5, 10^5)
m <- matrix("", nrow = 10^5, ncol = 5)
for(i in seq(ncol(m))) {
m[,i] <- space[(samps %% length(space)) 1]
samps <- samps %/% length(space)
}
ID <- apply(m, 1, paste, collapse = "")
We can see this fulfils our requirements:
head(ID)
#> [1] "vpdnq" "rK0ej" "ofE9t" "PqLIr" "6G6tu" "Vhc7R"
length(ID)
#> [1] 100000
length(unique(ID))
#> [1] 100000
The whole thing takes less than a second on my modest machine:
user system elapsed
0.72 0.00 0.74
Created on 2022-05-15 by the reprex package (v2.0.1)
CodePudding user response:
You can try the code below (given N <- 1e5
and k <- 5
):
n <- ceiling(N^(1 / k))
S <- sample(c(LETTERS, letters, 0:9), n)
ID <- head(do.call(paste0, expand.grid(rep(list(S), k))),N)
where
n
gives a subset of the whole space that supports all unique combinations up to given numberN
, e.g.,N <- 100000
S
denotes a sub-space from which we draw the alphabets or digitsexpand.grid
gives all combinations
CodePudding user response:
If you don't need randomness, the highly performant arrangements
package can help by iterating over the permutations in order, not generating any more than are needed:
library(arrangements)
x = c(letters, LETTERS, 0:9)
ix = ipermutations(x = x, k = 5)
ind = ix$getnext(d = nrow(mydata))
mydata$ID = apply(ind, MAR = 1, FUN = \(i) paste(x[i], collapse = ""))
rbind(head(mydata), tail(mydata))
# ID Poids
# 1 abcde 64.46278
# 2 abcdf 62.00053
# 3 abcdg 75.71787
# 4 abcdh 67.73765
# 5 abcdi 66.45402
# 6 abcdj 66.85561
# 99995 abFpe 56.20545
# 99996 abFpf 64.14443
# 99997 abFpg 70.70191
# 99998 abFph 66.83226
# 99999 abFpi 65.22835
# 100000 abFpj 56.28880
This is quite fast:
user system elapsed
0.194 0.001 0.203