Check whether two columns in one dataframe have the same values [aside from NA's] in R-CodePudding

I have 1 dataframe, j:

Chr|Pos|A0|A1|rsID|Beta-A1|P|info|maf|se|rsid  
1|16021|C|T|NA|0.410|0.26|0.842|0.01|NA|rs1163602158   
1|17483|C|T|rs845637483|-0.356|0.32|0.856|0.01|NA|rs845637483    
1|19250|T|C|NA|-0.255|0.54|0.812|0.01|NA|rs7465843777   
1|39402|T|TCAA|NA|-0.873|0.37|0.821|0.01|NA|rs2746475333   
1|39883|G|C|NA|0.195|0.59|0.808|0.01|NA|rs2726463882

I want to check whether the rows in rsID and rsid are the same ASIDE from the NAs in the former column

So I can do

table(ifelse(j$rsID==j$rsid,"Yes","No"))

No      Yes
701232 18207968

And I can do

table(is.na(j$rsID))

FALSE     TRUE
18909200  2550533

table(is.na(j$rsid))

   FALSE
21459733

So I can see that there are 701232 instances where they don't match, but these are not ALL because of NA because there are MORE (2550533) NA than instances of them not matching?

Is there a better / cleaner way of doing this, so I can get a better idea of this?

Thanks

CodePudding user response：

Could remove NA then filter where they are not equal:

library(dplyr)
library(tidyr)
j %>%
  drop_na(rsID, rsid) %>%
  filter(rsID != rsid) # Or == instead of != to keep where they are equal

CodePudding user response：

Another dplyr option

j %>%
  rowwise() %>%
  mutate(duplicate = anyDuplicated(na.omit(c(rsid, rsID)))) %>%
  mutate(duplicate = ifelse(duplicate > 1, "Yes", "No")) %>% count(duplicate)

Output

# A tibble: 2 x 2
# Rowwise: 
  duplicate     n
  <chr>     <int>
1 No            4
2 Yes           1

CodePudding user response：

We can use base R

with(na.omit(j[c('rsID', 'rsid')]),table(ifelse(rsID == rsid, "Yes", "No"))  )

CodePudding user response：

# Load dplyr library
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)

# you already have j defined so this step is only for this demo
j <- tibble(Chr = c(1, 1, 1, 1, 1), 
            Pos = c(16021, 17483, 19250, 39402, 39883), 
            A0 = c("C", "C", "T", "T", "G"), 
            A1 = c("T", "T", "C", "TCAA", "C"), 
            rsID = c(NA, "rs845637483", NA, NA, NA), 
            `Beta-A1` = c(0.41, -0.356, -0.255, -0.873, 0.195), P = c(0.26,0.32, 0.54, 0.37, 0.59), 
            info = c(0.842, 0.856, 0.812, 0.821, 0.808), 
            maf = c(0.01, 0.01, 0.01, 0.01, 0.01), se = c(NA, NA, NA, NA, NA), 
            rsid = c("rs1163602158", "rs845637483", "rs7465843777","rs2746475333", "rs2726463882"))

# create a column is_same and use count()
j %>% 
  mutate(is_same = if_else(rsid == rsID, "Yes", "No", "No")) %>% 
  count(is_same)
#> # A tibble: 2 x 2
#>   is_same     n
#>   <chr>   <int>
#> 1 No          4
#> 2 Yes         1