I have a df like this:
testdf <- structure(list(POS = c(37, 44, 50, 83), Col1 = c("A", "C", NA,
"G"), Col2 = c("A", NA, "T", "C")), class = "data.frame", row.names = c(NA,
-4L))
which looks like that:
POS Col1 Col2
[1,] "37" "A" "A"
[2,] "44" "C" NA
[3,] "50" NA "T"
[4,] "83" "G" "C"
And i would like to exclude all rows which are the same between Col1 and Col2 (that means only row 1). Unfortunately i do not know how to deal with the NAs. When i try
testdf[testdf$Col1 != testdf$Col2,]
it does not consider NAs as an own entry?
The expected output should be:
POS Col1 Col2
[1,] "44" "C" NA
[2,] "50" NA "T"
[3,] "83" "G" "C"
I would rather not transform NAs into something else.
testdf %>%
rowwise %>%
filter(Col1 != Col2)
Is also not working correctly.
CodePudding user response:
You can add is.na()
to your filter condition.
You should also handle the case where both columns are NA
; I added a row like this to your example data. If you want to keep these rows, then:
library(dplyr)
testdf %>%
filter(is.na(Col1) | is.na(Col2) | Col1 != Col2)
POS Col1 Col2
1 44 C <NA>
2 50 <NA> T
3 83 G C
4 99 <NA> <NA>
If you want to remove them, use xor()
instead of |
:
testdf %>%
filter(xor(is.na(Col1), is.na(Col2)) | Col1 != Col2)
POS Col1 Col2
1 44 C <NA>
2 50 <NA> T
3 83 G C
CodePudding user response:
NA == NA
returns NA
but NA %in% NA
returns TRUE
. So you can use that in a mapply
call to do rowwise comparison:
testdf[!mapply(`%in%`, testdf$Col1, testdf$Col2),]
POS Col1 Col2
2 44 C <NA>
3 50 <NA> T
4 83 G C
CodePudding user response:
testdf[testdf$Col1 != testdf$Col2 | is.na(testdf$Col1 != testdf$Col2), ]
# Or more concisely
testdf[with(testdf, Col1 != Col2 | is.na(Col1 != Col2)), ]
# POS Col1 Col2
# 2 44 C <NA>
# 3 50 <NA> T
# 4 83 G C