I have a follow up to this question. Here is a dataset
df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))
# A tibble: 15 × 3
city year value
<chr> <dbl> <dbl>
1 la 1 NA
2 la 2 NA
3 la 3 NA
4 la 4 20
5 la 5 25
6 nyc 1 18
7 nyc 2 29
8 nyc 3 24
9 nyc 4 17
10 nyc 5 30
11 bos 1 12
12 bos 2 19
13 bos 3 17
14 bos 4 22
15 bos 5 NA
I'm trying remove any cities that don't have data for at least years 4 and 5. In this case, I would like to keep all 5 rows for la and nyc, but remove all 5 rows for bos as it doesn't have data for year 5.
The solution to the previous question provided by @onyambu:
df %>%
group_by(city) %>%
filter(!anyNA(value))
would successfully remove both la and bos, but I'd like to keep la as it has data for 2 years of interest. Is this possible? Thank you in advance.
CodePudding user response:
Just adjust your code a little bit:
df %>%
group_by(city) %>%
filter( !anyNA(value[year %in% 4:5]) )
# # A tibble: 10 × 3
# # Groups: city [2]
# city year value
# <chr> <dbl> <dbl>
# 1 la 1 NA
# 2 la 2 NA
# 3 la 3 NA
# 4 la 4 20
# 5 la 5 25
# 6 nyc 1 18
# 7 nyc 2 29
# 8 nyc 3 24
# 9 nyc 4 17
# 10 nyc 5 30
CodePudding user response:
Here is one potential solution:
library(dplyr)
df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))
df %>%
group_by(city) %>%
filter(!any(year %in% 4:5 & is.na(value)))
#> # A tibble: 10 × 3
#> # Groups: city [2]
#> city year value
#> <chr> <dbl> <dbl>
#> 1 la 1 NA
#> 2 la 2 NA
#> 3 la 3 NA
#> 4 la 4 20
#> 5 la 5 25
#> 6 nyc 1 18
#> 7 nyc 2 29
#> 8 nyc 3 24
#> 9 nyc 4 17
#> 10 nyc 5 30
Created on 2022-09-07 by the reprex package (v2.0.1)
CodePudding user response:
This approach should cover the general case too where you want to make sure the final two rows aren't NA (not just the 4th and 5th year)!
library(tidyverse)
df <- tibble(city =c("la", "la", "la", "la", "la", "nyc", "nyc", "nyc", "nyc", "nyc", "bos", "bos", "bos", "bos", "bos"),
year = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
value=c(NA, NA, NA, 20, 25, 18, 29, 24, 17, 30, 12, 19, 17, 22, NA))
df |>
group_by(city) |>
# Remove the groups the don't have values in the last two years
filter(!is.na(last(value)) & !is.na(nth(value, n()-1)))
#> # A tibble: 10 × 3
#> # Groups: city [2]
#> city year value
#> <chr> <dbl> <dbl>
#> 1 la 1 NA
#> 2 la 2 NA
#> 3 la 3 NA
#> 4 la 4 20
#> 5 la 5 25
#> 6 nyc 1 18
#> 7 nyc 2 29
#> 8 nyc 3 24
#> 9 nyc 4 17
#> 10 nyc 5 30
Created on 2022-09-07 by the reprex package (v2.0.1)