Home > Software design >  Filter rows at first occurence of specific value by group
Filter rows at first occurence of specific value by group

Time:09-02

I have data with a grouping variable "id" and some values in "x". If a group has more than one NA in 'x', e.g. id 2, I need to filter rows at the index of the first NA.

data <- data.frame(id = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
                   x = c('no','no','yes','yes','yes','no',NA,NA,NA,'yes','no','no',NA,NA,'yes','yes','yes', 'yes', 'yes'))
 data
   id    x
1   1   no
2   1   no
3   1  yes
4   1  yes
5   2  yes
6   2   no
7   2 <NA> # slice id 2 at this position
8   2 <NA>
9   2 <NA>
10  3  yes
11  3   no
12  3   no
13  3 <NA>  # slice id 3 at this position 
14  3 <NA>
15  4  yes
16  4  yes
17  4  yes
18  4  yes
19  4  yes

Desired outcome

 data
   id    x
1   1   no
2   1   no
3   1  yes
4   1  yes
5   2  yes
6   2   no
7   2 <NA>
8   3  yes
9   3   no
10  3   no
11  3 <NA>
12  4  yes
13  4  yes
14  4  yes
15  4  yes
19  4  yes

I try it using

library(tidyverse)
Sliced <- data %>% group_by(id) %>%  
    slice(seq_len(min(which(x==NA)[1], n())))

CodePudding user response:

How about this:

data %>%
  group_by(id) %>%
  mutate(nas = cumsum(is.na(x))) %>%  # counts cumulative NAs in x
  filter(nas <= 1) %>%                # remove if we've seen more than 1 NA for the id
  select(-i)                          # drop temp variable

Note that you didn't specify what assumptions can be made about the incoming data. For example:

  • is it always in some order such that the NAs/non-NA values don't get jumbled:
  • will non-NA values followed by NA values ever return to non-NA again?
  • do you still want to slice if you've seen just one NA (might be implied, given the previous answer)?
  • what if there's just 1 NA and no non-NA values preceding?

CodePudding user response:

Various dplyr-options:

A filter and a lag? (Depending on data structure you might want to group_by(id))

library(dplyr)

data |> 
  filter(!is.na(x) | (is.na(x) & !is.na(lag(x))))

Or with slice as you suggest yourself:

library(dplyr)

data |>
  group_by(id) |>
  slice(1:first(which(is.na(x)), default = n())) |>
  ungroup()

Or using duplicated like @ThomasIsCoding:

library(dplyr)

data |>
  group_by(id) |>
  filter(!(duplicated(x) & is.na(x))) |>
  ungroup()

Not forgetting @Hendys: filter(cumsum(is.na(x)) < 2)

Output:

   id    x
1   1   no
2   1   no
3   1  yes
4   1  yes
5   2  yes
6   2   no
7   2 <NA>
8   3  yes
9   3   no
10  3   no
11  3 <NA>
12  4  yes
13  4  yes
14  4  yes
15  4  yes
16  4  yes

CodePudding user response:

A base R option using subset ave duplicated

subset(
  data,
  !(ave(is.na(x), id, is.na(x), FUN = duplicated) & is.na(x))
)

gives

   id    x
1   1   no
2   1   no
3   1  yes
4   1  yes
5   2  yes
6   2   no
7   2 <NA>
10  3  yes
11  3   no
12  3   no
13  3 <NA>
15  4  yes
16  4  yes
17  4  yes
18  4  yes
19  4  yes
  • Related