Home > database >  How to generate IDs for each participant in R
How to generate IDs for each participant in R

Time:10-14

I've a medical data of approximately 10,000 patients. I want to replace their IDs/Social Security Numbers (Patient_SSN) with a unique ID for each patient. Please note that some of the rows have the same participant SSN, this is is because the data is stored on visit level. In other words, each visit is stored in a new row (i.e. with different date), such as 'Mary' and 'John' data.

Patient_Name = c("Alex", "Mary", "Sarah", "John", "Susan", "Jessica", "Sarah", "Karen", "Mary", "John")
Patient_SSN  =  c(1234,    43251,    9320,    2901,  3229,     4291,     9320,    9218988,    43251 ,  2901)
Visit_Date   =  c('10_21', '10_21',  '10_25', '10_25','10_26','10_27','10_28','10_28','10_28' ,'10_29')
BMI = runif(10, min=12, max =25);

data_hospital = data.frame(Patient_Name, Patient_SSN, BMI, Visit_Date)

My question is: how can replace each SSN with a new ID for participant privacy, but keep in mind that some rows have the same SSN? The length of the characters of the new SSNs/IDs should be the same as the length of the original Patient_SSN characters. Thank you in advance for assistance.

CodePudding user response:

dplyr has a function for that! Check out ?group_data:

library(dplyr)
data_hospital$newid <- data_hospital %>% group_indices(Patient_SSN)

   Patient_Name Patient_SSN      BMI Visit_Date newid
1          Alex        1234 21.70192      10_21     1
2          Mary       43251 18.75820      10_21     6
3         Sarah        9320 22.84921      10_25     5
4          John        2901 19.94831      10_25     2
5         Susan        3229 20.27007      10_26     3
6       Jessica        4291 14.39934      10_27     4
7         Sarah        9320 16.65728      10_28     5
8         Karen     9218988 17.99142      10_28     7
9          Mary       43251 20.71236      10_28     6
10         John        2901 12.67764      10_29     2

CodePudding user response:

One way to do it, if you want the length of the Pateint_SSN to be kept, would be to generate a random number between 0 and 1, and multiply it by 10^(length_of_number).

This won't guarantee they are unique IDs so you would need to check for that and generate new numbers if there are duplicates but that is unlikely to occur.

library(dplyr)
data_hospital <- data_hospital %>% mutate(id_length = nchar(Patient_SSN))
data_hospital$random_number <- runif(n = nrow(data_hospital),min = 0, max = 1)
data_hospital <- data_hospital %>% mutate(new_id = round(random_number*10^id_length))
  •  Tags:  
  • r
  • Related