Home > Mobile >  R - map vector of unique values to dataframe column with duplicates
R - map vector of unique values to dataframe column with duplicates

Time:12-04

I have a column in a dataframe that is a character vector. I would like to add to my dataframe a column containing unique ID values/codes corresponding to each unique value in said column. Here is some toy data:

fnames <- c("joey", "joey", "joey", "jimmy", "jimmy", "tommy", "michael", "michael", "michael", "michael", "michael", "kevin", "kevin", "christopher", "aaron", "joshua", "joshua", "joshua", "arvid", "aiden", "kentavious", "lawrence", "xavier")

names <- as.data.frame(fnames)

To get the number of unique values of fnames I run:

unique_fnames <- length(unique(names$fnames))

To generate unique IDs for each unique name, I found the following function:

create_unique_ids <- function(n, seed_no = 16169, char_len = 6){
  set.seed(seed_no)
  pool <- c(letters, LETTERS, 0:9)
  
  res <- character(n)
  for(i in seq(n)){
    this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    while(this_res %in% res){
      this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    }
    res[i] <- this_res
  }
  res
}

Applying create_unique_ids to unique_fnames I get the desired number of ID codes:

unique_fname_id <- create_unique_ids(unique_fnames)

My question is this:

How do I add the vector of unique_fname_id to my dataframe names? The desired result is a dataframe names with a unique_fname_id column that looks something like this:

unique_fname_id <- c("VvWMKt", "VvWMKt", "VvWMKt", "yEbpFq", "yEbpFq", "Z3xCdO"...)

where "VvWMKt" corresponds to "joey", "yEbpFq" corresponds to "jimmy" and so on. The dataframe names would be the same length as the original, just with this added column.

Is there a way to do this? All suggestions are welcome and appreciated. Thanks!

Edit: I need to keep the set.seed in the create_unique_ids function to ensure the IDs generated can be reproduced continuously.

CodePudding user response:

A crude approach is to left join it back

library(tidyverse)

fnames <- c("joey", "joey", "joey", "jimmy", "jimmy", "tommy", "michael", "michael", "michael", "michael", "michael", "kevin", "kevin", "christopher", "aaron", "joshua", "joshua", "joshua", "arvid", "aiden", "kentavious", "lawrence", "xavier")

names <- as.data.frame(fnames)


unique_names <- names |> distinct()

unique_fnames <- length(unique(names$fnames))

create_unique_ids <- function(n, seed_no = 16169, char_len = 6){
  set.seed(seed_no)
  pool <- c(letters, LETTERS, 0:9)
  
  res <- character(n)
  for(i in seq(n)){
    this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    while(this_res %in% res){
      this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    }
    res[i] <- this_res
  }
  res
}

unique_fname_id <- create_unique_ids(unique_fnames)


df_ids <- tibble(fnames = unique_names |> pull(fnames),unique_fname_id = unique_fname_id)


names |> 
  left_join(df_ids)
#> Joining, by = "fnames"
#>         fnames unique_fname_id
#> 1         joey          VvWMKt
#> 2         joey          VvWMKt
#> 3         joey          VvWMKt
#> 4        jimmy          yEbpFq
#> 5        jimmy          yEbpFq
#> 6        tommy          Z3xCdO
#> 7      michael          ef8YkZ
#> 8      michael          ef8YkZ
#> 9      michael          ef8YkZ
#> 10     michael          ef8YkZ
#> 11     michael          ef8YkZ
#> 12       kevin          kDBFAq
#> 13       kevin          kDBFAq
#> 14 christopher          xR77mJ
#> 15       aaron          gaaI1C
#> 16      joshua          KM4dD9
#> 17      joshua          KM4dD9
#> 18      joshua          KM4dD9
#> 19       arvid          oTLl7g
#> 20       aiden          b63PnV
#> 21  kentavious          csnWuE
#> 22    lawrence          Ihi5VM
#> 23      xavier          HfM0mX

Created on 2021-12-03 by the reprex package (v2.0.1)

CodePudding user response:

If you want to use your function and keep the seed, you can do:

names %>% 
  distinct(fnames) %>% 
  bind_cols(unique_ID = create_unique_ids(13)) %>% 
  left_join(names)

You can also remove the seed (the set.seed(seed_no) line and parameter) from your function and have a simpler solution:

names %>% 
  group_by(fnames) %>% 
  mutate(unique_ID = create_unique_ids(1))

   fnames  unique_ID
   <chr>   <chr>    
 1 joey    ea10KC   
 2 joey    ea10KC   
 3 joey    ea10KC   
 4 jimmy   MD5W4d   
 5 jimmy   MD5W4d   
 6 tommy   xR7ozW   
 7 michael uuGn3h   
 8 michael uuGn3h   
 9 michael uuGn3h   
10 michael uuGn3h   
# ... with 13 more rows

You can also use a built-in function like stringi::stri_rand_strings, which creates random alphanumerical strings with a fixed number of characters:

library(stringi); library(dplyr)

names %>% 
  group_by(fnames) %>% 
  mutate(unique_ID = stri_rand_strings(1, 6))
  • Related