I have the following dataframe
color sex id var value
1 blue 0 1 dog_clm NA
2 blue 0 1 dog_clm NA
3 blue 0 1 dog_clm 1
4 blue 1 2 dog_clm NA
5 blue 1 2 dog_clm 2
6 blue 1 2 dog_clm NA
7 blue 1 3 dog_clm NA
8 blue 1 3 dog_clm 6
9 blue 1 3 dog_clm NA
10 blue 1 4 dog_clm 7
11 blue 1 4 dog_clm NA
12 blue 1 4 dog_clm 1
I want to remove the ids where there are multiple values. How can I do this? Expected output would remove id 4 because it has values of 7 and 1.
Expected output
color sex id var value
1 blue 0 1 dog_clm NA
2 blue 0 1 dog_clm NA
3 blue 0 1 dog_clm 1
4 blue 1 2 dog_clm NA
5 blue 1 2 dog_clm 2
6 blue 1 2 dog_clm NA
7 blue 1 3 dog_clm NA
8 blue 1 3 dog_clm 6
9 blue 1 3 dog_clm NA
CodePudding user response:
Not sure why CIAndrews deleted their answer, but using their sample data with the extra all-NA group added:
df %>%
group_by(id) %>%
filter(n_distinct(value, na.rm = TRUE) <= 1)
# # A tibble: 12 × 5
# # Groups: id [4]
# color sex id var value
# <chr> <dbl> <int> <chr> <dbl>
# 1 blue 0 1 dog_clm NA
# 2 blue 0 1 dog_clm NA
# 3 blue 0 1 dog_clm 1
# 4 blue 1 2 dog_clm NA
# 5 blue 1 2 dog_clm 2
# 6 blue 1 2 dog_clm NA
# 7 blue 1 3 dog_clm NA
# 8 blue 1 3 dog_clm 6
# 9 blue 1 3 dog_clm NA
# 10 blue 1 5 dog_clm NA
# 11 blue 1 5 dog_clm NA
# 12 blue 1 5 dog_clm NA
Sample Data:
df <- structure(list(color = c("blue", "blue", "blue", "blue", "blue",
"blue", "blue", "blue", "blue", "blue", "blue", "blue", "blue",
"blue", "blue"), sex = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1), id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L,
4L, 5L, 5L, 5L), var = c("dog_clm", "dog_clm", "dog_clm", "dog_clm",
"dog_clm", "dog_clm", "dog_clm", "dog_clm", "dog_clm", "dog_clm",
"dog_clm", "dog_clm", "dog_clm", "dog_clm", "dog_clm"), value = c(NA,
NA, 1, NA, 2, NA, NA, 6, NA, 7, NA, 1, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-15L))
CodePudding user response:
Here's what I'd do:
- Group by id as the other comments have suggested.
- Find all the duplicate values, including the 1st instance. The base R function
duplicated
will find any instances where some value already exists but it won't tag the 1st unless you also include the argumentfromLast = TRUE
, in which case it will ignore the last duplicate. You can use both, though, together in an "or" statement.
Here's how I'd do this using a tidyverse approach:
MyDataFrame <- MyDataFrame %>%
group_by(id) %>%
filter(which(duplicated(value) | duplicated(value, fromLast = TRUE))
That should include NA values the way you want.