How to create a unique identifier for 100000 with 5 characters?-CodePudding

I have 100,000 individuals Using a combination of upper case letters, lower case letters and numbers, I want to create a five-character ID for each individual. I should not have any duplicates. How can I do this? I have tried the code below but I have 4 duplicates.

What is the number of possible unique combinations to create a 5 character ID with "letters", "LETTERS" and "0:9"?

set.seed(0)
    
    mydata<-data.frame(
      ID=rep(NA,10^5),
      Poids=rnorm(n=10^5,mean = 65,sd=5)
    )
    
    
    for (i in 1:nrow(mydata)){
      
      mydata$ID[i]<-c(
        paste(sample(c(0:9,LETTERS,letters),replace = F,size = 1),             
              sample(c(0:9,LETTERS,letters),replace = F,size = 1),  
              sample(c(0:9,LETTERS,letters),replace = F,size = 1),               
              sample(c(0:9,LETTERS,letters),replace = F,size = 1),
              sample(c(0:9,LETTERS,letters),replace = F,size = 1),sep = "")
      )       
    }
    
    
    table(duplicated(mydata$ID))

FALSE  TRUE 
99996     4

CodePudding user response：

(length(letters) length(LETTERS) length(0:9))^5 is 91,6132,832, so there is plenty of space to avoid clashes.

In fact, we can use this number to help generate our sample. We draw 100,000 integers out of 91,6132,832 without replacement and interpret each number as its unique string of characters using a bit of modular math and indexing. This can all be done in a single pass:

space <- c(LETTERS, letters, 0:9)

set.seed(0)

samps <- sample(length(space)^5, 10^5)

m <- matrix("", nrow = 10^5, ncol = 5)

for(i in seq(ncol(m))) {
  m[,i] <- space[(samps %% length(space))   1]
  samps <- samps %/% length(space)
}

ID <- apply(m, 1, paste, collapse = "")

We can see this fulfils our requirements:

head(ID)
#> [1] "vpdnq" "rK0ej" "ofE9t" "PqLIr" "6G6tu" "Vhc7R"

length(ID)
#> [1] 100000

length(unique(ID))
#> [1] 100000

The whole thing takes less than a second on my modest machine:

   user  system elapsed 
   0.72    0.00    0.74

^{Created on 2022-05-15 by the reprex package (v2.0.1)}

CodePudding user response：

You can try the code below (given N <- 1e5 and k <- 5):

n <- ceiling(N^(1 / k))
S <- sample(c(LETTERS, letters, 0:9), n)
ID <- head(do.call(paste0, expand.grid(rep(list(S), k))),N)

where

n gives a subset of the whole space that supports all unique combinations up to given number N, e.g., N <- 100000
S denotes a sub-space from which we draw the alphabets or digits
expand.grid gives all combinations

CodePudding user response：

If you don't need randomness, the highly performant arrangements package can help by iterating over the permutations in order, not generating any more than are needed:

library(arrangements)

x = c(letters, LETTERS, 0:9)
ix = ipermutations(x = x, k = 5)

ind = ix$getnext(d = nrow(mydata))
mydata$ID = apply(ind, MAR = 1, FUN = \(i) paste(x[i], collapse = ""))

rbind(head(mydata), tail(mydata))
#           ID    Poids
# 1      abcde 64.46278
# 2      abcdf 62.00053
# 3      abcdg 75.71787
# 4      abcdh 67.73765
# 5      abcdi 66.45402
# 6      abcdj 66.85561
# 99995  abFpe 56.20545
# 99996  abFpf 64.14443
# 99997  abFpg 70.70191
# 99998  abFph 66.83226
# 99999  abFpi 65.22835
# 100000 abFpj 56.28880

This is quite fast:

  user  system elapsed 
  0.194   0.001   0.203