I am trying to figure out how to filter a table by a categorical variable in order to determine if there is a non NA entry in another variable's column. Imagine we have a table as so.
Country| year | population
---------------------------
USA | 2000 | Na
Canada | 2000 | Na
Mexico | 2000 | 2000000
USA | 2001 | Na
Canada | 2001 | Na
Mexico | 2001 | NA
I am trying to produce a table whose output would be something like this.
year | has_population
---------------------
2000 | True
2001 | False
Since 2000 has a value in the population column, it would be true while 2001 has no entries in the population columns so it would be false. I have tried grouping my table by year and running a summarize function on the new table but that does not give me the result I want. Any tips or suggestions?
CodePudding user response:
One option is to change "Na" to "NA" then use is.na(population)
to evaluate each group, i.e.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- read.table(text = "Country | year | population
USA | 2000 | Na
Canada | 2000 | Na
Mexico | 2000 | 2000000
USA | 2001 | Na
Canada | 2001 | Na
Mexico | 2001 | NA",
sep = "|",
header = TRUE,
strip.white = TRUE)
df %>%
mutate(population = ifelse(population == "Na", NA, population)) %>%
group_by(year) %>%
summarise(has_population = ifelse(any(!is.na(population)), "True", "False"))
#> # A tibble: 2 × 2
#> year has_population
#> <int> <chr>
#> 1 2000 True
#> 2 2001 False
Created on 2022-02-21 by the reprex package (v2.0.1)
CodePudding user response:
base R option (but also with first converting the Na
to NA
as @jared_mamrot did):
df[df == "Na" ] <- NA
results <- aggregate(df$population, list(df$year), function(x){
any(!is.na(x))})
names(results) <- c("year", "has_population")
Output
year has_population
1 2000 TRUE
2 2001 FALSE
Data
df <- structure(list(Country = c("USA", "Canada", "Mexico", "USA",
"Canada", "Mexico"),
year = c(2000L, 2000L, 2000L, 2001L, 2001L, 2001L),
population = c("Na", "Na", "2000000", "Na", "Na", NA)),
class = "data.frame", row.names = c(NA, -6L))