Consider this example nested dataframe of 3 counties, 3 towns, and a range of zip codes associated with them. Two of the towns share the same name (B
), but are in different counties:
df <- tibble(
county = c(1,1,1,2,2,2,2,3),
town = c("A","A","A","B","B","B","B","B"),
zip = c(12864,12865,12866,89501,89502,89503,89504,76512)) %>%
nest(data=c(zip))
I have another dataframe that contains town names, a zipcode, and a placeholder value, but is missing the county field:
df2 <- tibble(
town = c("A", "B", "B"),
zip = c(12866, 89504, 76512),
value = c("foo", "bar", "ski"))
My real data has hundreds of instances of these duplicated town names, and I need to join these two tables together so that each town gets the correct placeholder value based on the zip code (not the town name, which has duplicates). However, dplyr
only seems to join on equality. As such, I'm stuck - what I'm after is something like inner_join(df, df2, by = c(df2$zip %in% df$data$zip))
, but that obviously doesn't work.
I'm also aware of data.table
being able to handle inequality in joins, but this always seems to relate to greater than/less than conditionals. How can I successfully join these tables to return the following output, in situations where I have more than 3 neatly matched rows between the dataframes?
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 x 1]> foo
2 2 B <tibble [4 x 1]> bar
3 3 B <tibble [1 x 1]> ski
CodePudding user response:
I think you'll have to "roll your own join":
df %>% mutate(value = df2$value[
sapply(data, function(x) match(unlist(x), df2$zip) %>% .[!is.na(.)])
])
This works for the example provided, but I'm not clear whether there could be multiple matches to df2$zip within a group of df$data$zip's.
CodePudding user response:
We could do this with map
library(purrr)
library(dplyr)
df %>%
mutate(value = map_chr(data, ~ inner_join(.x, df2, by = 'zip') %>%
pull(value)))
-output
# A tibble: 3 × 4
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 × 1]> foo
2 2 B <tibble [4 × 1]> bar
3 3 B <tibble [1 × 1]> ski
Or another option is regex_inner_join
library(fuzzyjoin)
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(zip = map_chr(data, ~ str_c(.x$zip, collapse="|"))) %>%
regex_inner_join(df2 %>%
select(-town), by = "zip") %>%
select(-starts_with('zip'))
-output
# A tibble: 3 × 4
county town data value
<dbl> <chr> <list> <chr>
1 1 A <tibble [3 × 1]> foo
2 2 B <tibble [4 × 1]> bar
3 3 B <tibble [1 × 1]> ski