I have a column in a dataframe that is a character vector. I would like to add to my dataframe a column containing unique ID values/codes corresponding to each unique value in said column. Here is some toy data:
fnames <- c("joey", "joey", "joey", "jimmy", "jimmy", "tommy", "michael", "michael", "michael", "michael", "michael", "kevin", "kevin", "christopher", "aaron", "joshua", "joshua", "joshua", "arvid", "aiden", "kentavious", "lawrence", "xavier")
names <- as.data.frame(fnames)
To get the number of unique values of fnames
I run:
unique_fnames <- length(unique(names$fnames))
To generate unique IDs for each unique name, I found the following function:
create_unique_ids <- function(n, seed_no = 16169, char_len = 6){
set.seed(seed_no)
pool <- c(letters, LETTERS, 0:9)
res <- character(n)
for(i in seq(n)){
this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
while(this_res %in% res){
this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
}
res[i] <- this_res
}
res
}
Applying create_unique_ids
to unique_fnames
I get the desired number of ID codes:
unique_fname_id <- create_unique_ids(unique_fnames)
My question is this:
How do I add the vector of unique_fname_id
to my dataframe names
? The desired result is a dataframe names
with a unique_fname_id
column that looks something like this:
unique_fname_id <- c("VvWMKt", "VvWMKt", "VvWMKt", "yEbpFq", "yEbpFq", "Z3xCdO"...)
where "VvWMKt"
corresponds to "joey"
, "yEbpFq"
corresponds to "jimmy"
and so on. The dataframe names
would be the same length as the original, just with this added column.
Is there a way to do this? All suggestions are welcome and appreciated. Thanks!
Edit: I need to keep the set.seed
in the create_unique_ids
function to ensure the IDs generated can be reproduced continuously.
CodePudding user response:
A crude approach is to left join it back
library(tidyverse)
fnames <- c("joey", "joey", "joey", "jimmy", "jimmy", "tommy", "michael", "michael", "michael", "michael", "michael", "kevin", "kevin", "christopher", "aaron", "joshua", "joshua", "joshua", "arvid", "aiden", "kentavious", "lawrence", "xavier")
names <- as.data.frame(fnames)
unique_names <- names |> distinct()
unique_fnames <- length(unique(names$fnames))
create_unique_ids <- function(n, seed_no = 16169, char_len = 6){
set.seed(seed_no)
pool <- c(letters, LETTERS, 0:9)
res <- character(n)
for(i in seq(n)){
this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
while(this_res %in% res){
this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
}
res[i] <- this_res
}
res
}
unique_fname_id <- create_unique_ids(unique_fnames)
df_ids <- tibble(fnames = unique_names |> pull(fnames),unique_fname_id = unique_fname_id)
names |>
left_join(df_ids)
#> Joining, by = "fnames"
#> fnames unique_fname_id
#> 1 joey VvWMKt
#> 2 joey VvWMKt
#> 3 joey VvWMKt
#> 4 jimmy yEbpFq
#> 5 jimmy yEbpFq
#> 6 tommy Z3xCdO
#> 7 michael ef8YkZ
#> 8 michael ef8YkZ
#> 9 michael ef8YkZ
#> 10 michael ef8YkZ
#> 11 michael ef8YkZ
#> 12 kevin kDBFAq
#> 13 kevin kDBFAq
#> 14 christopher xR77mJ
#> 15 aaron gaaI1C
#> 16 joshua KM4dD9
#> 17 joshua KM4dD9
#> 18 joshua KM4dD9
#> 19 arvid oTLl7g
#> 20 aiden b63PnV
#> 21 kentavious csnWuE
#> 22 lawrence Ihi5VM
#> 23 xavier HfM0mX
Created on 2021-12-03 by the reprex package (v2.0.1)
CodePudding user response:
If you want to use your function and keep the seed, you can do:
names %>%
distinct(fnames) %>%
bind_cols(unique_ID = create_unique_ids(13)) %>%
left_join(names)
You can also remove the seed (the set.seed(seed_no)
line and parameter) from your function and have a simpler solution:
names %>%
group_by(fnames) %>%
mutate(unique_ID = create_unique_ids(1))
fnames unique_ID
<chr> <chr>
1 joey ea10KC
2 joey ea10KC
3 joey ea10KC
4 jimmy MD5W4d
5 jimmy MD5W4d
6 tommy xR7ozW
7 michael uuGn3h
8 michael uuGn3h
9 michael uuGn3h
10 michael uuGn3h
# ... with 13 more rows
You can also use a built-in function like stringi::stri_rand_strings
, which creates random alphanumerical strings with a fixed number of characters:
library(stringi); library(dplyr)
names %>%
group_by(fnames) %>%
mutate(unique_ID = stri_rand_strings(1, 6))