Home > database >  What syntax to check if values in one data is also listed in another data frame?
What syntax to check if values in one data is also listed in another data frame?

Time:10-22

I have 2 dataframes: wd_patients and wd_appointments.

How do I check if all patients in wd_patients are also listed in wd_appointments and vice versa. And then how can I update these two files such that every patient in one dataset also exists in the other one as well?

Please see the structures of my dataframes below:

 str(wd_patients)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   990 obs. of  5 variables:
 $ ID                : chr  "QJSkgvEFvP" "UwSKF6tmaG" "TgO3Lo9cR6" "DFDaFnBxU9" ...
 $ Age               : num  63 64 60 76 66 68 69 87 74 85 ...
 $ Gender            : chr  "Male" "Female" "Female" "Female" ...
 $ AppointmentCount  : num  3 3 3 3 5 2 2 2 5 2 ...
 $ TreatmentCompleted: chr  "Yes" "No" "No" "Yes" ...

 str(wd_appointments)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3473 obs. of  3 variables:
 $ patientID : chr  "IEBQggWteS" "nCOytLO2tK" "Kc7D1GZhqr" "GTyawkFnQn" ...
 $ apptDate  : POSIXct, format: "2022-06-01" "2022-06-01" "2022-06-01" "2022-06-01" ...
 $ apptStatus: chr  "No Show" "No Show" "Cancelled" "Cancelled" ...

CodePudding user response:

You can start by finding patient ids that are in wd_patients and not wd_appointments and vice versa...

ids1 <- setdiff(wd_patients$ID, wd_appointments$patientID) # ids in wd_patients but not wd_appointments
ids2 <- setdiff(wd_appointments$patientID, wd_patients$ID) # ids in wd_appointments but not wd_patients

After that you can decide how you want to synchronize the two data.frames

CodePudding user response:

Using the base R you can use 'merge' function and you obtain only those observations which are existing in both datasets.

mergedData <- merge(wd_patients, wd_appointments , by.x=c('ID'), by.y=c('patientID'))

If you don't want join other variables from the second data set, then you can use the following code:

mergedData <- merge(wd_patients, wd_appointments[,c('patientID')] , by.x=c('ID'), by.y=c('patientID'))

CodePudding user response:

You can do a left_join:

library(dplyr)
left_join(wd_patients, wd_appointments, by = c("ID", "patientID"))

This will join wd_appointments to wd_patients where there is a shared ID.

EDIT:

If you want to retain only the rows with matches in both data frames, use inner_join.

  • Related