What syntax to check if values in one data is also listed in another data frame?-CodePudding

I have 2 dataframes: wd_patients and wd_appointments.

How do I check if all patients in wd_patients are also listed in wd_appointments and vice versa. And then how can I update these two files such that every patient in one dataset also exists in the other one as well?

Please see the structures of my dataframes below:

 str(wd_patients)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   990 obs. of  5 variables:
 $ ID                : chr  "QJSkgvEFvP" "UwSKF6tmaG" "TgO3Lo9cR6" "DFDaFnBxU9" ...
 $ Age               : num  63 64 60 76 66 68 69 87 74 85 ...
 $ Gender            : chr  "Male" "Female" "Female" "Female" ...
 $ AppointmentCount  : num  3 3 3 3 5 2 2 2 5 2 ...
 $ TreatmentCompleted: chr  "Yes" "No" "No" "Yes" ...

 str(wd_appointments)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3473 obs. of  3 variables:
 $ patientID : chr  "IEBQggWteS" "nCOytLO2tK" "Kc7D1GZhqr" "GTyawkFnQn" ...
 $ apptDate  : POSIXct, format: "2022-06-01" "2022-06-01" "2022-06-01" "2022-06-01" ...
 $ apptStatus: chr  "No Show" "No Show" "Cancelled" "Cancelled" ...

CodePudding user response：

You can start by finding patient ids that are in wd_patients and not wd_appointments and vice versa...

ids1 <- setdiff(wd_patients$ID, wd_appointments$patientID) # ids in wd_patients but not wd_appointments
ids2 <- setdiff(wd_appointments$patientID, wd_patients$ID) # ids in wd_appointments but not wd_patients

After that you can decide how you want to synchronize the two data.frames

CodePudding user response：

Using the base R you can use 'merge' function and you obtain only those observations which are existing in both datasets.

mergedData <- merge(wd_patients, wd_appointments , by.x=c('ID'), by.y=c('patientID'))

If you don't want join other variables from the second data set, then you can use the following code:

mergedData <- merge(wd_patients, wd_appointments[,c('patientID')] , by.x=c('ID'), by.y=c('patientID'))

CodePudding user response：

You can do a left_join:

library(dplyr)
left_join(wd_patients, wd_appointments, by = c("ID", "patientID"))

This will join wd_appointments to wd_patients where there is a shared ID.

EDIT:

If you want to retain only the rows with matches in both data frames, use inner_join.