I have a df1 with 820 rows:
ID College Score Score3
EI01 0 1 2
EI01 0 1 6
EI08,EI07 1 4 4
EI08,EI07 1 4 8
EI02 0 0 9
EI05 1 2 2
EI06 1 10 12
I have a df2 with 713 rows:
ID Points
EI01 20
EI08,EI07 12
EI02 30
EI04 10
I have tried merging these two df by "ID" with all=FALSE but my new merged df is 864 rows, when I would only like it to maximally be 820 rows given the # in df1. I would like to know what I am doing wrong. Additionally, I would like to make a df3 that includes all of the IDs in df1 that are not present in df2.
CodePudding user response:
If you want to join data and have an automatic QC check, you could consider using the tidylog
package, which replaces many dplyr
functions with more verbose ones.
Here's an example of tidylog::left_join()
with your sample data:
> left_join(df1,df2)
Joining, by = "ID"
left_join: added one column (Points)
> rows only in x 2
> rows only in y (1)
> matched rows 5
> ===
> rows total 7
ID College Score Score3 Points
1: EI01 0 1 2 20
2: EI01 0 1 6 20
3: EI08,EI07 1 4 4 12
4: EI08,EI07 1 4 8 12
5: EI02 0 0 9 30
6: EI05 1 2 2 NA
7: EI06 1 10 12 NA
To find IDs that are in df1
and not in df2
you can use:
df1 %>%
filter(!ID %in% df2$ID) %>%
pull(ID)
[1] "EI05" "EI06"