I've a medical data of approximately 10,000 patients. I want to replace their IDs/Social Security Numbers (Patient_SSN) with a unique ID for each patient. Please note that some of the rows have the same participant SSN, this is is because the data is stored on visit level. In other words, each visit is stored in a new row (i.e. with different date), such as 'Mary' and 'John' data.
Patient_Name = c("Alex", "Mary", "Sarah", "John", "Susan", "Jessica", "Sarah", "Karen", "Mary", "John")
Patient_SSN = c(1234, 43251, 9320, 2901, 3229, 4291, 9320, 9218988, 43251 , 2901)
Visit_Date = c('10_21', '10_21', '10_25', '10_25','10_26','10_27','10_28','10_28','10_28' ,'10_29')
BMI = runif(10, min=12, max =25);
data_hospital = data.frame(Patient_Name, Patient_SSN, BMI, Visit_Date)
My question is: how can replace each SSN with a new ID for participant privacy, but keep in mind that some rows have the same SSN? The length of the characters of the new SSNs/IDs should be the same as the length of the original Patient_SSN characters. Thank you in advance for assistance.
CodePudding user response:
dplyr has a function for that! Check out ?group_data
:
library(dplyr)
data_hospital$newid <- data_hospital %>% group_indices(Patient_SSN)
Patient_Name Patient_SSN BMI Visit_Date newid
1 Alex 1234 21.70192 10_21 1
2 Mary 43251 18.75820 10_21 6
3 Sarah 9320 22.84921 10_25 5
4 John 2901 19.94831 10_25 2
5 Susan 3229 20.27007 10_26 3
6 Jessica 4291 14.39934 10_27 4
7 Sarah 9320 16.65728 10_28 5
8 Karen 9218988 17.99142 10_28 7
9 Mary 43251 20.71236 10_28 6
10 John 2901 12.67764 10_29 2
CodePudding user response:
One way to do it, if you want the length of the Pateint_SSN
to be kept, would be to generate a random number between 0 and 1, and multiply it by 10^(length_of_number)
.
This won't guarantee they are unique IDs so you would need to check for that and generate new numbers if there are duplicates but that is unlikely to occur.
library(dplyr)
data_hospital <- data_hospital %>% mutate(id_length = nchar(Patient_SSN))
data_hospital$random_number <- runif(n = nrow(data_hospital),min = 0, max = 1)
data_hospital <- data_hospital %>% mutate(new_id = round(random_number*10^id_length))