Conditionally join R tables based on nested membership-CodePudding

Consider this example nested dataframe of 3 counties, 3 towns, and a range of zip codes associated with them. Two of the towns share the same name (B), but are in different counties:

df <- tibble(
  county = c(1,1,1,2,2,2,2,3),
  town = c("A","A","A","B","B","B","B","B"),
  zip = c(12864,12865,12866,89501,89502,89503,89504,76512)) %>% 
  nest(data=c(zip))

I have another dataframe that contains town names, a zipcode, and a placeholder value, but is missing the county field:

df2 <- tibble(
  town = c("A", "B", "B"),
  zip = c(12866, 89504, 76512),
  value = c("foo", "bar", "ski"))

My real data has hundreds of instances of these duplicated town names, and I need to join these two tables together so that each town gets the correct placeholder value based on the zip code (not the town name, which has duplicates). However, dplyr only seems to join on equality. As such, I'm stuck - what I'm after is something like inner_join(df, df2, by = c(df2$zip %in% df$data$zip)), but that obviously doesn't work.

I'm also aware of data.table being able to handle inequality in joins, but this always seems to relate to greater than/less than conditionals. How can I successfully join these tables to return the following output, in situations where I have more than 3 neatly matched rows between the dataframes?

  county town  data             value
   <dbl> <chr> <list>           <chr>
1      1 A     <tibble [3 x 1]> foo  
2      2 B     <tibble [4 x 1]> bar  
3      3 B     <tibble [1 x 1]> ski

CodePudding user response：

I think you'll have to "roll your own join":

df %>% mutate(value = df2$value[
  sapply(data, function(x) match(unlist(x), df2$zip) %>% .[!is.na(.)])
  ])

This works for the example provided, but I'm not clear whether there could be multiple matches to df2$zip within a group of df$data$zip's.

CodePudding user response：

We could do this with map

library(purrr)
library(dplyr)
df %>% 
    mutate(value = map_chr(data, ~ inner_join(.x, df2, by = 'zip') %>% 
          pull(value)))

-output

# A tibble: 3 × 4
  county town  data             value
   <dbl> <chr> <list>           <chr>
1      1 A     <tibble [3 × 1]> foo  
2      2 B     <tibble [4 × 1]> bar  
3      3 B     <tibble [1 × 1]> ski

Or another option is regex_inner_join

library(fuzzyjoin)
library(stringr)
library(dplyr)
library(purrr)
df %>%
    mutate(zip = map_chr(data, ~ str_c(.x$zip, collapse="|"))) %>% 
    regex_inner_join(df2 %>%
          select(-town), by = "zip") %>% 
    select(-starts_with('zip'))

-output

# A tibble: 3 × 4
  county town  data             value
   <dbl> <chr> <list>           <chr>
1      1 A     <tibble [3 × 1]> foo  
2      2 B     <tibble [4 × 1]> bar  
3      3 B     <tibble [1 × 1]> ski