I am creating a codebook for tidycensus variables I am pulling for all ACS5 years from 2009-2020. To make sure or to flag any differences in variables across years, I am trying to put a check column in at the end. However, there seems to be some change in the naming center at 2018, which I'd like to ignore in favor of identifying real problems.
#selecing and recoding variables to pull in
dv_acs = c(
same1 = "B25002_001",
same2 = "B25002_002",
diff1 = "C24010_039"
)
#creating a loop to pull in an join a codebook for all years
out <- map(2009:2020, ~ {
nm <- str_c(c("label", "concept"), "_", .x)
load_variables(.x, "acs5") %>%
select(-any_of("geography")) %>%
filter(name %in% dv_acs) %>%
mutate(id = names(dv_acs), .before = 1) %>%
rename_with(~ nm, c("label", "concept"))
}) %>%
reduce(full_join)
#putting in checks
out <- out %>%
rowwise %>%
mutate(label_flag = n_distinct(unlist(across(starts_with('label'),
~ as.character(.x)))) == 1) %>%
ungroup
Okay, from above, the first two variables (same1, same2), would get a TRUE value in the label_flag column if it worked how I want it to, but because there's a ":" introduced into the string in later years, it comes up false. For comparison, diff1 has a truly different value between the 2009 and later labels (it goes from "Estimate!!Total!Female" to "Estimate!!Total:!!Female:!!Management, business, science, and arts occupations:", this should show up as FALSE in the label_flag column.
I don't know if I should introduce something using grepl or put in a string dist somehow, and would appreciate any solutions ya got.
CodePudding user response:
We may use pmap
to loop over the rows of selected columns that starts with 'label', remove all the :
and then get the n_distinct
to check for only single unique value
library(dplyr)
library(purrr)
library(stringr)
out <- out %>%
mutate(label_flag = pmap_int(across(starts_with('label')),
~ n_distinct(str_remove_all(c(...), ":")) == 1))