I am looking for a way to remove any rows where the values across each variable don't match, unless the lack of a match is due to one of the values (or more) being NA.
df <- data.frame(uid = c(1, 2, 3, 4, 5, 6, 7),
x1 = c('a', NA, NA, 'a', 'a', 'c', 'b'),
x2 = c(NA, 'b', NA, NA, NA, NA, 'b'),
x3 = c(NA, NA, 'c', 'a', 'b', NA, 'b'),
x4 = c(NA, NA, NA, NA, NA, NA, 'b'),
x5 = c(NA, NA, NA, NA, NA, 'b', 'b'))
df
uid x1 x2 x3 x4 x5
1 1 a <NA> <NA> <NA> <NA>
2 2 <NA> b <NA> <NA> <NA>
3 3 <NA> <NA> c <NA> <NA>
4 4 a <NA> a <NA> <NA>
5 5 a <NA> b <NA> <NA>
6 6 c <NA> <NA> <NA> b
7 7 b b b b b
What I am looking for is a method that would remove rows 5 and 6. I tried several methods using subset and was able to get it to work if I did:
subset(df, (x1 == x2 | is.na(x1) | is.na(x2)) & ...
(x4 == x5 | is.na(x4) | is.na(x5)))
However, I couldn't find a method that didn't involve typing out every combination, which is already impractical with just 5 columns, but not at all practical given the actual dataset.
CodePudding user response:
Not the cleanest solution, but I made a function that achieved the result.
Libraries
library(dplyr)
library(stringr)
Data
df <- data.frame(uid = c(1, 2, 3, 4, 5, 6, 7),
x1 = c('a', NA, NA, 'a', 'a', 'c', 'b'),
x2 = c(NA, 'b', NA, NA, NA, NA, 'b'),
x3 = c(NA, NA, 'c', 'a', 'b', NA, 'b'),
x4 = c(NA, NA, NA, NA, NA, NA, 'b'),
x5 = c(NA, NA, NA, NA, NA, 'b', 'b'))
Code
remove_rows <- function(string){
if(sum(is.na(string)) >= (length(string) -1)){
TRUE
}else{
string %>%
na.omit() %>%
str_c(.,collapse = "") %>%
str_count(string) %>%
max(na.rm = TRUE) > 1
}
}
df %>%
rowwise() %>%
filter(remove_rows(c_across(x1:x5)))
Output
# A tibble: 5 x 6
# Rowwise:
uid x1 x2 x3 x4 x5
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 a NA NA NA NA
2 2 NA b NA NA NA
3 3 NA NA c NA NA
4 4 a NA a NA NA
5 7 b b b b b
CodePudding user response:
Remove the lines where the number of unique, non-NA values in columns of interest exceeds 1:
df[ ! apply(df[-1], 1, function(x) length(unique(na.omit(x))) ) >1, ]
#--------------------
uid x1 x2 x3 x4 x5
1 1 a <NA> <NA> <NA> <NA>
2 2 <NA> b <NA> <NA> <NA>
3 3 <NA> <NA> c <NA> <NA>
4 4 a <NA> a <NA> <NA>
7 7 b b b b b