Comparing two dataset, show number of total rows, if different show subjectids-CodePudding

I have two datasets, where in the column edta_complete 0 is incomplete and 1 is complete. I am trying to compare these columns in df and df1. 1)I need to compare the counts of subject_ids with complete edta in both datasets. 2) If one dataset has more complete entries than another then show the subject_ids which differ. Please see e.g. below:

df:

df <- structure (list(subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-2365"), edta_complete = c("1","0","1","1","1","0")), class = "data.frame", row.names = c (NA, -6L))

df1:

df1 <- structure (list(subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-2365"), edta_complete = c("1","1","1","1","1","1")), class = "data.frame", row.names = c (NA, -6L))

Counts of edta_complete = 1

df %>% filter(edta_complete == 1) %>% nrow()
[1] 4

df1 %>% filter(edta_complete == 1) %>% nrow()
[1] 6

I need a code which will show me that in df1 191-6784 and 191-2365 differ from df. Hope this makes sense.

CodePudding user response：

We can use setdiff to find the subject_id that are found in df1 and not in df

setdiff(with(df1, subject_id[edta_complete == 1]), 
      with(df, subject_id[edta_complete == 1]))
[1] "191-6784" "191-2365"

Or use anti_join

library(dplyr)
df1 %>% 
  filter(edta_complete == 1) %>% 
  anti_join(df %>%
      filter(edta_complete == 1), by = 'subject_id') %>% 
  pull(subject_id)
[1] "191-6784" "191-2365"

CodePudding user response：

And also this using bind_cols():

library(dplyr)

bind_cols(df, df1) %>% 
  filter(edta_complete...2 != edta_complete...4) %>% 
  pull(subject_id...1)

[1] "191-6784" "191-2365"