Home > Software engineering >  Filter for NAs only if group has more than 1 observation in dplyr
Filter for NAs only if group has more than 1 observation in dplyr

Time:11-19

I would like to be able to keep NAs for only groups that have more than two entries and just want to leave alone any groups that have 1 entry (regardless if they have NAs or not). That is, if the group has two elements, keep only the NA. If it has one then just take whatever is there. Here is a reprex of the type of data I have:

library(dplyr)

data <- data.frame(
  x = c(1, NA_real_, 3, NA_real_),
  y = c("grp1", "grp2", "grp2", "grp3")
)
data
#>    x    y
#> 1  1 grp1
#> 2 NA grp2
#> 3  3 grp2
#> 4 NA grp3

Then here is the fairly ugly way I have achieved what I want:

raw <- data %>%
  group_by(y) %>%
  mutate(n = n())

results <- bind_rows(
  raw %>%
    filter(n == 2) %>%
    filter(is.na(x)),
  raw %>%
    filter(n == 1)
) %>%
  ungroup() %>%
  select(-n)

results
#> # A tibble: 3 × 2
#>       x y    
#>   <dbl> <chr>
#> 1    NA grp2 
#> 2     1 grp1 
#> 3    NA grp3

CodePudding user response:

We could define filter with the max(row_number(): update: instead of max(row_number() we could use n() (many thanks to @Juan C)

library(dplyr)
data %>% 
  group_by(y) %>% 
   filter(!(is.na(x) & n() == 1))
 # filter(!(is.na(x) & max(row_number()) == 1))
     x y    
  <dbl> <chr>
1     1 grp1 
2    NA grp2 
3     3 grp2 

CodePudding user response:

Here's another approach that works

data %>% 
  group_by(y) %>%  
  filter(!(!is.na(x) & n() > 1))
# A tibble: 3 × 2
# Groups:   y [3]
      x y    
  <dbl> <chr>
1     1 grp1 
2    NA grp2 
3    NA grp3

CodePudding user response:

You specifically asked to do using dplyr, however, you can do this using data.table. It is more intuitive to follow:

library(data.table)
setDT(data)
data <- data[, N := .N, by = y][N <= 1 | !is.na(x)][, N := NULL]
data

Output:

   x    y
1:  1 grp1
2:  3 grp2
3: NA grp3

If the objective is to retain NAs cases if group has more than one observations, then follow:

data <- data[, N := .N, by = y][N <= 1 | is.na(x)][, N := NULL]
data

Output:

    x    y
1:  1 grp1
2: NA grp2
3: NA grp3
  • Related