I have two datasets, where in the column edta_complete
0 is incomplete and 1 is complete. I am trying to compare these columns in df and df1. 1)I need to compare the counts of subject_ids with complete edta in both datasets. 2) If one dataset has more complete entries than another then show the subject_ids which differ. Please see e.g. below:
df:
df <- structure (list(subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-2365"), edta_complete = c("1","0","1","1","1","0")), class = "data.frame", row.names = c (NA, -6L))
df1:
df1 <- structure (list(subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-2365"), edta_complete = c("1","1","1","1","1","1")), class = "data.frame", row.names = c (NA, -6L))
Counts of edta_complete = 1
df %>% filter(edta_complete == 1) %>% nrow()
[1] 4
df1 %>% filter(edta_complete == 1) %>% nrow()
[1] 6
I need a code which will show me that in df1 191-6784
and 191-2365
differ from df.
Hope this makes sense.
CodePudding user response:
We can use setdiff
to find the subject_id that are found in df1 and not in df
setdiff(with(df1, subject_id[edta_complete == 1]),
with(df, subject_id[edta_complete == 1]))
[1] "191-6784" "191-2365"
Or use anti_join
library(dplyr)
df1 %>%
filter(edta_complete == 1) %>%
anti_join(df %>%
filter(edta_complete == 1), by = 'subject_id') %>%
pull(subject_id)
[1] "191-6784" "191-2365"
CodePudding user response:
And also this using bind_cols()
:
library(dplyr)
bind_cols(df, df1) %>%
filter(edta_complete...2 != edta_complete...4) %>%
pull(subject_id...1)
[1] "191-6784" "191-2365"