Conditionally removing groups based on NA values-CodePudding

I have a follow up to this question. Here is a dataset

df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
             year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
             value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))

# A tibble: 15 × 3
   city   year value
   <chr> <dbl> <dbl>
 1 la        1    NA
 2 la        2    NA
 3 la        3    NA
 4 la        4    20
 5 la        5    25
 6 nyc       1    18
 7 nyc       2    29
 8 nyc       3    24
 9 nyc       4    17
10 nyc       5    30
11 bos       1    12
12 bos       2    19
13 bos       3    17
14 bos       4    22
15 bos       5    NA

I'm trying remove any cities that don't have data for at least years 4 and 5. In this case, I would like to keep all 5 rows for la and nyc, but remove all 5 rows for bos as it doesn't have data for year 5.

The solution to the previous question provided by @onyambu:

df %>%
  group_by(city) %>%
  filter(!anyNA(value))

would successfully remove both la and bos, but I'd like to keep la as it has data for 2 years of interest. Is this possible? Thank you in advance.

CodePudding user response：

Just adjust your code a little bit:

df %>%
  group_by(city) %>%
  filter( !anyNA(value[year %in% 4:5]) )

# # A tibble: 10 × 3
# # Groups:   city [2]
#    city   year value
#    <chr> <dbl> <dbl>
#  1 la        1    NA
#  2 la        2    NA
#  3 la        3    NA
#  4 la        4    20
#  5 la        5    25
#  6 nyc       1    18
#  7 nyc       2    29
#  8 nyc       3    24
#  9 nyc       4    17
# 10 nyc       5    30

CodePudding user response：

Here is one potential solution:

library(dplyr)

df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
             year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
             value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))
df %>%
  group_by(city) %>%
  filter(!any(year %in% 4:5 & is.na(value)))
#> # A tibble: 10 × 3
#> # Groups:   city [2]
#>    city   year value
#>    <chr> <dbl> <dbl>
#>  1 la        1    NA
#>  2 la        2    NA
#>  3 la        3    NA
#>  4 la        4    20
#>  5 la        5    25
#>  6 nyc       1    18
#>  7 nyc       2    29
#>  8 nyc       3    24
#>  9 nyc       4    17
#> 10 nyc       5    30

^{Created on 2022-09-07 by the reprex package (v2.0.1)}

CodePudding user response：

This approach should cover the general case too where you want to make sure the final two rows aren't NA (not just the 4th and 5th year)!

library(tidyverse)

df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
             year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
             value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))

df |> 
  group_by(city) |> 
  # Remove the groups the don't have values in the last two years
  filter(!is.na(last(value)) & !is.na(nth(value, n()-1)))
#> # A tibble: 10 × 3
#> # Groups:   city [2]
#>    city   year value
#>    <chr> <dbl> <dbl>
#>  1 la        1    NA
#>  2 la        2    NA
#>  3 la        3    NA
#>  4 la        4    20
#>  5 la        5    25
#>  6 nyc       1    18
#>  7 nyc       2    29
#>  8 nyc       3    24
#>  9 nyc       4    17
#> 10 nyc       5    30

^{Created on 2022-09-07 by the reprex package (v2.0.1)}