I have received an data frame for analysis, each observation is a row, with 120 variables. Unfortunately I have not received an observation ID variable that uniquely identifies each observations.
I was thinking maybe I could concatenate all columns to a string and hash this string to obtain a unique ID.
How can I do this without specifying all variables like with paste()
. Or is there another solution?
The data can contain NA
here is the sample dataset
structure(list(Class = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), levels = c("1st", "2nd",
"3rd", "Crew"), class = "factor"), Sex = structure(c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), levels = c("Male",
"Female"), class = "factor"), Age = structure(c(1L, NA, 1L, NA,
1L, NA, 1L, 1L, 2L, 2L, NA, 2L, 2L, 2L, 2L, NA, 1L, 1L, 1L, NA,
NA, 1L, 1L, 1L, NA, 2L, 2L, 2L, 2L, 2L, 2L, NA), levels = c("Child",
"Adult"), class = "factor"), Survived = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("No",
"Yes"), class = "factor"), Freq = c(0, 0, 35, 0, 0, 0, 17, 0,
118, 154, 387, 670, 4, 13, 89, 3, 5, 11, 13, 0, 1, 13, 14, 0,
57, 14, 75, 192, 140, 80, 76, 20)), row.names = c(NA, -32L), class = "data.frame")
CodePudding user response:
Maybe you want to use the unique_identifier
function from the udpipe
package which does:
Create a unique identifier for each combination of fields in a data frame. This unique identifier is unique for each combination of the elements of the fields. The generated identifier is like a primary key or a secondary key on a table. This is just a small wrapper around frank
Here reproducible example:
df <- structure(list(Class = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), levels = c("1st", "2nd",
"3rd", "Crew"), class = "factor"), Sex = structure(c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), levels = c("Male",
"Female"), class = "factor"), Age = structure(c(1L, NA, 1L, NA,
1L, NA, 1L, 1L, 2L, 2L, NA, 2L, 2L, 2L, 2L, NA, 1L, 1L, 1L, NA,
NA, 1L, 1L, 1L, NA, 2L, 2L, 2L, 2L, 2L, 2L, NA), levels = c("Child",
"Adult"), class = "factor"), Survived = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("No",
"Yes"), class = "factor"), Freq = c(0, 0, 35, 0, 0, 0, 17, 0,
118, 154, 387, 670, 4, 13, 89, 3, 5, 11, 13, 0, 1, 13, 14, 0,
57, 14, 75, 192, 140, 80, 76, 20)), row.names = c(NA, -32L), class = "data.frame")
library(udpipe)
#> Warning: package 'udpipe' was built under R version 4.1.2
df$ID <- unique_identifier(df, fields = colnames(df))
df
#> Class Sex Age Survived Freq ID
#> 1 1st Male Child No 0 1
#> 2 2nd Male <NA> No 0 12
#> 3 3rd Male Child No 35 17
#> 4 Crew Male <NA> No 0 27
#> 5 1st Female Child No 0 5
#> 6 2nd Female <NA> No 0 16
#> 7 3rd Female Child No 17 21
#> 8 Crew Female Child No 0 29
#> 9 1st Male Adult No 118 3
#> 10 2nd Male Adult No 154 10
#> 11 3rd Male <NA> No 387 20
#> 12 Crew Male Adult No 670 25
#> 13 1st Female Adult No 4 6
#> 14 2nd Female Adult No 13 14
#> 15 3rd Female Adult No 89 23
#> 16 Crew Female <NA> No 3 31
#> 17 1st Male Child Yes 5 2
#> 18 2nd Male Child Yes 11 9
#> 19 3rd Male Child Yes 13 18
#> 20 Crew Male <NA> Yes 0 28
#> 21 1st Female <NA> Yes 1 8
#> 22 2nd Female Child Yes 13 13
#> 23 3rd Female Child Yes 14 22
#> 24 Crew Female Child Yes 0 30
#> 25 1st Male <NA> Yes 57 4
#> 26 2nd Male Adult Yes 14 11
#> 27 3rd Male Adult Yes 75 19
#> 28 Crew Male Adult Yes 192 26
#> 29 1st Female Adult Yes 140 7
#> 30 2nd Female Adult Yes 80 15
#> 31 3rd Female Adult Yes 76 24
#> 32 Crew Female <NA> Yes 20 32
Created on 2022-07-24 by the reprex package (v2.0.1)
CodePudding user response:
Another option is to use unclass
on factors (i.e., after pasting all columns together using Reduce
), which will convert the factors to their numbers.
df$ID <- c(unclass(as.factor(Reduce(paste, df))))
Output
Class Sex Age Survived Freq ID
1 1st Male Child No 0 6
2 2nd Male <NA> No 0 16
3 3rd Male Child No 35 22
4 Crew Male <NA> No 0 31
5 1st Female Child No 0 3
6 2nd Female <NA> No 0 12
7 3rd Female Child No 17 19
8 Crew Female Child No 0 25
9 1st Male Adult No 118 5
10 2nd Male Adult No 154 13
11 3rd Male <NA> No 387 24
12 Crew Male Adult No 670 29
13 1st Female Adult No 4 1
14 2nd Female Adult No 13 9
15 3rd Female Adult No 89 17
16 Crew Female <NA> No 3 27
17 1st Male Child Yes 5 7
18 2nd Male Child Yes 11 15
19 3rd Male Child Yes 13 23
20 Crew Male <NA> Yes 0 32
21 1st Female <NA> Yes 1 4
22 2nd Female Child Yes 13 11
23 3rd Female Child Yes 14 20
24 Crew Female Child Yes 0 26
25 1st Male <NA> Yes 57 8
26 2nd Male Adult Yes 14 14
27 3rd Male Adult Yes 75 21
28 Crew Male Adult Yes 192 30
29 1st Female Adult Yes 140 2
30 2nd Female Adult Yes 80 10
31 3rd Female Adult Yes 76 18
32 Crew Female <NA> Yes 20 28