Home > Software engineering >  checking if strings match across multiple columns using fuzzy matching or grep to remove certain gra
checking if strings match across multiple columns using fuzzy matching or grep to remove certain gra

Time:10-27

I am creating a codebook for tidycensus variables I am pulling for all ACS5 years from 2009-2020. To make sure or to flag any differences in variables across years, I am trying to put a check column in at the end. However, there seems to be some change in the naming center at 2018, which I'd like to ignore in favor of identifying real problems.

#selecing and recoding variables to pull in
dv_acs = c(
  same1          = "B25002_001", 
  same2       = "B25002_002", 
  diff1       = "C24010_039"
)


#creating a loop to pull in an join a codebook for all years
out <-  map(2009:2020, ~ {
  nm <- str_c(c("label", "concept"), "_", .x)
  load_variables(.x, "acs5") %>%
    select(-any_of("geography")) %>%
    filter(name %in% dv_acs) %>%
    mutate(id = names(dv_acs), .before = 1) %>%
    rename_with(~ nm, c("label", "concept"))
}) %>%
  reduce(full_join)


#putting in checks
out <- out %>% 
  rowwise %>% 
  mutate(label_flag = n_distinct(unlist(across(starts_with('label'), 
                                               ~ as.character(.x)))) == 1) %>%
  ungroup

Okay, from above, the first two variables (same1, same2), would get a TRUE value in the label_flag column if it worked how I want it to, but because there's a ":" introduced into the string in later years, it comes up false. For comparison, diff1 has a truly different value between the 2009 and later labels (it goes from "Estimate!!Total!Female" to "Estimate!!Total:!!Female:!!Management, business, science, and arts occupations:", this should show up as FALSE in the label_flag column.

I don't know if I should introduce something using grepl or put in a string dist somehow, and would appreciate any solutions ya got.

CodePudding user response:

We may use pmap to loop over the rows of selected columns that starts with 'label', remove all the : and then get the n_distinct to check for only single unique value

library(dplyr)
library(purrr)
library(stringr)
out <- out %>%
    mutate(label_flag = pmap_int(across(starts_with('label')),
       ~ n_distinct(str_remove_all(c(...), ":")) == 1))
  • Related