How to create an unique observation ID using hash functions?-CodePudding

I have received an data frame for analysis, each observation is a row, with 120 variables. Unfortunately I have not received an observation ID variable that uniquely identifies each observations. I was thinking maybe I could concatenate all columns to a string and hash this string to obtain a unique ID. How can I do this without specifying all variables like with paste(). Or is there another solution?

The data can contain NA

here is the sample dataset

structure(list(Class = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), levels = c("1st", "2nd", 
"3rd", "Crew"), class = "factor"), Sex = structure(c(1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), levels = c("Male", 
"Female"), class = "factor"), Age = structure(c(1L, NA, 1L, NA, 
1L, NA, 1L, 1L, 2L, 2L, NA, 2L, 2L, 2L, 2L, NA, 1L, 1L, 1L, NA, 
NA, 1L, 1L, 1L, NA, 2L, 2L, 2L, 2L, 2L, 2L, NA), levels = c("Child", 
"Adult"), class = "factor"), Survived = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("No", 
"Yes"), class = "factor"), Freq = c(0, 0, 35, 0, 0, 0, 17, 0, 
118, 154, 387, 670, 4, 13, 89, 3, 5, 11, 13, 0, 1, 13, 14, 0, 
57, 14, 75, 192, 140, 80, 76, 20)), row.names = c(NA, -32L), class = "data.frame")

CodePudding user response：

Maybe you want to use the unique_identifier function from the udpipe package which does:

Create a unique identifier for each combination of fields in a data frame. This unique identifier is unique for each combination of the elements of the fields. The generated identifier is like a primary key or a secondary key on a table. This is just a small wrapper around frank

Here reproducible example:

df <- structure(list(Class = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 
                                         4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 
                                         4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), levels = c("1st", "2nd", 
                                                                                         "3rd", "Crew"), class = "factor"), Sex = structure(c(1L, 1L, 
                                                                                                                                              1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
                                                                                                                                              1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), levels = c("Male", 
                                                                                                                                                                                                                  "Female"), class = "factor"), Age = structure(c(1L, NA, 1L, NA, 
                                                                                                                                                                                                                                                                  1L, NA, 1L, 1L, 2L, 2L, NA, 2L, 2L, 2L, 2L, NA, 1L, 1L, 1L, NA, 
                                                                                                                                                                                                                                                                  NA, 1L, 1L, 1L, NA, 2L, 2L, 2L, 2L, 2L, 2L, NA), levels = c("Child", 
                                                                                                                                                                                                                                                                                                                              "Adult"), class = "factor"), Survived = structure(c(1L, 1L, 1L, 
                                                                                                                                                                                                                                                                                                                                                                                  1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
                                                                                                                                                                                                                                                                                                                                                                                  2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("No", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                  "Yes"), class = "factor"), Freq = c(0, 0, 35, 0, 0, 0, 17, 0, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      118, 154, 387, 670, 4, 13, 89, 3, 5, 11, 13, 0, 1, 13, 14, 0, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      57, 14, 75, 192, 140, 80, 76, 20)), row.names = c(NA, -32L), class = "data.frame")


library(udpipe)
#> Warning: package 'udpipe' was built under R version 4.1.2
df$ID <- unique_identifier(df, fields = colnames(df))
df
#>    Class    Sex   Age Survived Freq ID
#> 1    1st   Male Child       No    0  1
#> 2    2nd   Male  <NA>       No    0 12
#> 3    3rd   Male Child       No   35 17
#> 4   Crew   Male  <NA>       No    0 27
#> 5    1st Female Child       No    0  5
#> 6    2nd Female  <NA>       No    0 16
#> 7    3rd Female Child       No   17 21
#> 8   Crew Female Child       No    0 29
#> 9    1st   Male Adult       No  118  3
#> 10   2nd   Male Adult       No  154 10
#> 11   3rd   Male  <NA>       No  387 20
#> 12  Crew   Male Adult       No  670 25
#> 13   1st Female Adult       No    4  6
#> 14   2nd Female Adult       No   13 14
#> 15   3rd Female Adult       No   89 23
#> 16  Crew Female  <NA>       No    3 31
#> 17   1st   Male Child      Yes    5  2
#> 18   2nd   Male Child      Yes   11  9
#> 19   3rd   Male Child      Yes   13 18
#> 20  Crew   Male  <NA>      Yes    0 28
#> 21   1st Female  <NA>      Yes    1  8
#> 22   2nd Female Child      Yes   13 13
#> 23   3rd Female Child      Yes   14 22
#> 24  Crew Female Child      Yes    0 30
#> 25   1st   Male  <NA>      Yes   57  4
#> 26   2nd   Male Adult      Yes   14 11
#> 27   3rd   Male Adult      Yes   75 19
#> 28  Crew   Male Adult      Yes  192 26
#> 29   1st Female Adult      Yes  140  7
#> 30   2nd Female Adult      Yes   80 15
#> 31   3rd Female Adult      Yes   76 24
#> 32  Crew Female  <NA>      Yes   20 32

^{Created on 2022-07-24 by the reprex package (v2.0.1)}

CodePudding user response：

Another option is to use unclass on factors (i.e., after pasting all columns together using Reduce), which will convert the factors to their numbers.

df$ID <- c(unclass(as.factor(Reduce(paste, df))))

Output

   Class    Sex   Age Survived Freq ID
1    1st   Male Child       No    0  6
2    2nd   Male  <NA>       No    0 16
3    3rd   Male Child       No   35 22
4   Crew   Male  <NA>       No    0 31
5    1st Female Child       No    0  3
6    2nd Female  <NA>       No    0 12
7    3rd Female Child       No   17 19
8   Crew Female Child       No    0 25
9    1st   Male Adult       No  118  5
10   2nd   Male Adult       No  154 13
11   3rd   Male  <NA>       No  387 24
12  Crew   Male Adult       No  670 29
13   1st Female Adult       No    4  1
14   2nd Female Adult       No   13  9
15   3rd Female Adult       No   89 17
16  Crew Female  <NA>       No    3 27
17   1st   Male Child      Yes    5  7
18   2nd   Male Child      Yes   11 15
19   3rd   Male Child      Yes   13 23
20  Crew   Male  <NA>      Yes    0 32
21   1st Female  <NA>      Yes    1  4
22   2nd Female Child      Yes   13 11
23   3rd Female Child      Yes   14 20
24  Crew Female Child      Yes    0 26
25   1st   Male  <NA>      Yes   57  8
26   2nd   Male Adult      Yes   14 14
27   3rd   Male Adult      Yes   75 21
28  Crew   Male Adult      Yes  192 30
29   1st Female Adult      Yes  140  2
30   2nd Female Adult      Yes   80 10
31   3rd Female Adult      Yes   76 18
32  Crew Female  <NA>      Yes   20 28