Home > OS >  r retain rows only if non missing in specific column
r retain rows only if non missing in specific column

Time:04-03

My dataframe has columns and rows like this

   Id    Date    Col1    Col2    Col3    X1
   1     1/1/22  NA      1       0       
   1     1/1/22  0       0       1       6
   2     5/7/21  0       1       0       
   2     5/7/21  0       2       0      

I like to drop rows where the duplicate row (same Id, same date) where values for column X1 is missing or empty. If both the rows are missing X1 for that ID and date then dont drop. Only when one is missing and other is not missing then drop the missing row.

Expected output

  Id    Date    Col1    Col2    Col3    X1
   1     1/1/22  0       0       1       6
   2     5/7/21  0       1       0       
   2     5/7/21  0       2       0     

I tried this

  library(tidyr)
  df %>%
  group_by(Id, Date) %>%
  drop_na(X1)

This drops all rows with NA or missing and I am just left with one row, which is not what I want. Any suggestions much apricated. Thanks.

CodePudding user response:

We can create a condition in filter to return all the rows if there are only missing values in 'X1' or just remove the missing rows

library(dplyr)
df %>% 
   group_by(Id, Date) %>% 
   filter(if(all(is.na(X1))) TRUE else complete.cases(X1)) %>%
   ungroup

-output

# A tibble: 3 × 6
     Id Date    Col1  Col2  Col3    X1
  <int> <chr>  <int> <int> <int> <int>
1     1 1/1/22     0     0     1     6
2     2 5/7/21     0     1     0    NA
3     2 5/7/21     0     2     0    NA

Or without the if/else, use | with & condition

df %>% 
  group_by(Id, Date) %>%
  filter(any(complete.cases(X1)) & complete.cases(X1) | 
       all(is.na(X1))) %>% 
  ungroup

data

df <- structure(list(Id = c(1L, 1L, 2L, 2L), Date = c("1/1/22", "1/1/22", 
"5/7/21", "5/7/21"), Col1 = c(NA, 0L, 0L, 0L), Col2 = c(1L, 0L, 
1L, 2L), Col3 = c(0L, 1L, 0L, 0L), X1 = c(NA, 6L, NA, NA)), 
class = "data.frame", row.names = c(NA, 
-4L))
  • Related