My dataframe has columns and rows like this
Id Date Col1 Col2 Col3 X1
1 1/1/22 NA 1 0
1 1/1/22 0 0 1 6
2 5/7/21 0 1 0
2 5/7/21 0 2 0
I like to drop rows where the duplicate row (same Id, same date) where values for column X1 is missing or empty. If both the rows are missing X1 for that ID and date then dont drop. Only when one is missing and other is not missing then drop the missing row.
Expected output
Id Date Col1 Col2 Col3 X1
1 1/1/22 0 0 1 6
2 5/7/21 0 1 0
2 5/7/21 0 2 0
I tried this
library(tidyr)
df %>%
group_by(Id, Date) %>%
drop_na(X1)
This drops all rows with NA or missing and I am just left with one row, which is not what I want. Any suggestions much apricated. Thanks.
CodePudding user response:
We can create a condition in filter
to return all
the rows if there are only missing values in 'X1' or just remove the missing rows
library(dplyr)
df %>%
group_by(Id, Date) %>%
filter(if(all(is.na(X1))) TRUE else complete.cases(X1)) %>%
ungroup
-output
# A tibble: 3 × 6
Id Date Col1 Col2 Col3 X1
<int> <chr> <int> <int> <int> <int>
1 1 1/1/22 0 0 1 6
2 2 5/7/21 0 1 0 NA
3 2 5/7/21 0 2 0 NA
Or without the if/else
, use |
with &
condition
df %>%
group_by(Id, Date) %>%
filter(any(complete.cases(X1)) & complete.cases(X1) |
all(is.na(X1))) %>%
ungroup
data
df <- structure(list(Id = c(1L, 1L, 2L, 2L), Date = c("1/1/22", "1/1/22",
"5/7/21", "5/7/21"), Col1 = c(NA, 0L, 0L, 0L), Col2 = c(1L, 0L,
1L, 2L), Col3 = c(0L, 1L, 0L, 0L), X1 = c(NA, 6L, NA, NA)),
class = "data.frame", row.names = c(NA,
-4L))