How to quality check this data in R?-CodePudding

I have a df1 with 820 rows:

ID        College   Score    Score3
EI01      0         1        2
EI01      0         1        6
EI08,EI07 1         4        4
EI08,EI07 1         4        8
EI02      0         0        9
EI05      1         2        2
EI06      1         10       12

I have a df2 with 713 rows:

ID          Points
EI01        20
EI08,EI07   12
EI02        30
EI04        10

I have tried merging these two df by "ID" with all=FALSE but my new merged df is 864 rows, when I would only like it to maximally be 820 rows given the # in df1. I would like to know what I am doing wrong. Additionally, I would like to make a df3 that includes all of the IDs in df1 that are not present in df2.

CodePudding user response：

If you want to join data and have an automatic QC check, you could consider using the tidylog package, which replaces many dplyr functions with more verbose ones.

Here's an example of tidylog::left_join() with your sample data:

> left_join(df1,df2)
Joining, by = "ID"
left_join: added one column (Points)
           > rows only in x   2
           > rows only in y  (1)
           > matched rows     5
           >                 ===
           > rows total       7
          ID College Score Score3 Points
1:      EI01       0     1      2     20
2:      EI01       0     1      6     20
3: EI08,EI07       1     4      4     12
4: EI08,EI07       1     4      8     12
5:      EI02       0     0      9     30
6:      EI05       1     2      2     NA
7:      EI06       1    10     12     NA

To find IDs that are in df1 and not in df2 you can use:

df1 %>% 
  filter(!ID %in% df2$ID) %>% 
  pull(ID)


[1] "EI05" "EI06"