Home > Software engineering >  Conditionally removing groups based on NA values
Conditionally removing groups based on NA values

Time:09-07

I have a follow up to this question. Here is a dataset

df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
             year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
             value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))

# A tibble: 15 × 3
   city   year value
   <chr> <dbl> <dbl>
 1 la        1    NA
 2 la        2    NA
 3 la        3    NA
 4 la        4    20
 5 la        5    25
 6 nyc       1    18
 7 nyc       2    29
 8 nyc       3    24
 9 nyc       4    17
10 nyc       5    30
11 bos       1    12
12 bos       2    19
13 bos       3    17
14 bos       4    22
15 bos       5    NA

I'm trying remove any cities that don't have data for at least years 4 and 5. In this case, I would like to keep all 5 rows for la and nyc, but remove all 5 rows for bos as it doesn't have data for year 5.

The solution to the previous question provided by @onyambu:

df %>%
  group_by(city) %>%
  filter(!anyNA(value))

would successfully remove both la and bos, but I'd like to keep la as it has data for 2 years of interest. Is this possible? Thank you in advance.

CodePudding user response:

Just adjust your code a little bit:

df %>%
  group_by(city) %>%
  filter( !anyNA(value[year %in% 4:5]) )

# # A tibble: 10 × 3
# # Groups:   city [2]
#    city   year value
#    <chr> <dbl> <dbl>
#  1 la        1    NA
#  2 la        2    NA
#  3 la        3    NA
#  4 la        4    20
#  5 la        5    25
#  6 nyc       1    18
#  7 nyc       2    29
#  8 nyc       3    24
#  9 nyc       4    17
# 10 nyc       5    30

CodePudding user response:

Here is one potential solution:

library(dplyr)

df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
             year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
             value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))
df %>%
  group_by(city) %>%
  filter(!any(year %in% 4:5 & is.na(value)))
#> # A tibble: 10 × 3
#> # Groups:   city [2]
#>    city   year value
#>    <chr> <dbl> <dbl>
#>  1 la        1    NA
#>  2 la        2    NA
#>  3 la        3    NA
#>  4 la        4    20
#>  5 la        5    25
#>  6 nyc       1    18
#>  7 nyc       2    29
#>  8 nyc       3    24
#>  9 nyc       4    17
#> 10 nyc       5    30

Created on 2022-09-07 by the reprex package (v2.0.1)

CodePudding user response:

This approach should cover the general case too where you want to make sure the final two rows aren't NA (not just the 4th and 5th year)!

library(tidyverse)

df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
             year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
             value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))

df |> 
  group_by(city) |> 
  # Remove the groups the don't have values in the last two years
  filter(!is.na(last(value)) & !is.na(nth(value, n()-1)))
#> # A tibble: 10 × 3
#> # Groups:   city [2]
#>    city   year value
#>    <chr> <dbl> <dbl>
#>  1 la        1    NA
#>  2 la        2    NA
#>  3 la        3    NA
#>  4 la        4    20
#>  5 la        5    25
#>  6 nyc       1    18
#>  7 nyc       2    29
#>  8 nyc       3    24
#>  9 nyc       4    17
#> 10 nyc       5    30

Created on 2022-09-07 by the reprex package (v2.0.1)

  • Related