How to group by a variable and detect if there are NAs in R-CodePudding

I am trying to figure out how to filter a table by a categorical variable in order to determine if there is a non NA entry in another variable's column. Imagine we have a table as so.

Country| year | population
---------------------------
USA    | 2000 | Na
Canada | 2000 | Na
Mexico | 2000 | 2000000
USA    | 2001 | Na
Canada | 2001 | Na
Mexico | 2001 | NA

I am trying to produce a table whose output would be something like this.

year | has_population
---------------------
2000 | True
2001 | False

Since 2000 has a value in the population column, it would be true while 2001 has no entries in the population columns so it would be false. I have tried grouping my table by year and running a summarize function on the new table but that does not give me the result I want. Any tips or suggestions?

CodePudding user response：

One option is to change "Na" to "NA" then use is.na(population) to evaluate each group, i.e.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- read.table(text = "Country | year | population
USA    | 2000 | Na
Canada | 2000 | Na
Mexico | 2000 | 2000000
USA    | 2001 | Na
Canada | 2001 | Na
Mexico | 2001 | NA",
sep = "|",
header = TRUE,
strip.white = TRUE)

df %>%
  mutate(population = ifelse(population == "Na", NA, population)) %>%
  group_by(year) %>%
  summarise(has_population = ifelse(any(!is.na(population)), "True", "False"))
#> # A tibble: 2 × 2
#>    year has_population
#>   <int> <chr>         
#> 1  2000 True          
#> 2  2001 False

^{Created on 2022-02-21 by the reprex package (v2.0.1)}

CodePudding user response：

base R option (but also with first converting the Na to NA as @jared_mamrot did):

df[df == "Na" ] <- NA

results <- aggregate(df$population, list(df$year), function(x){
  any(!is.na(x))})

names(results) <- c("year", "has_population")

Output

  year has_population
1 2000           TRUE
2 2001          FALSE

Data

df <- structure(list(Country = c("USA", "Canada", "Mexico", "USA", 
                                 "Canada", "Mexico"), 
                     year = c(2000L, 2000L, 2000L, 2001L, 2001L, 2001L), 
                     population = c("Na", "Na", "2000000", "Na", "Na", NA)), 
                class = "data.frame", row.names = c(NA, -6L))