R / dplyr: Joining two tables with wide vs. long format on joinable columns-CodePudding

I'm working with some public address data that I want to join together, but I'm not sure on the best approach and how to achieve this due to the different wider vs. longer formatting on the columns to join on.

My first table contains all addresses in the country; postal code adress number lead to unique combinations. Each address also is related to a specific neighborhood and area per county. This table does not contain any other info.

My second table contains relevant information regarding each neighborhood, area and county such as the number of residence, inhabitant ages, energy consumption, etc. The idea is simply to merge this information with the full addresses list so that I can view these statistics for each address in the country.

What gives me a headache is the different formatting of both tables.

The first table is formatted as follows such that every combination of address postal code is unique (but different addresses can be in the same county, area or neighborhood):

adresses <- data.frame("postal_code" = c("1000A", "1010A", "1000B", "1100B", "1500C", "2700C"),  
                 "adress_nr" = c(1, 2, 3, 15, 1, 35), 
                 "neighborhood" = c("A1", "A2", "B1", "B1", "C5", "C7"),
                 "area" = c("AA1", "AA2", "BB2", "BB1", "CC1", "CC3"),
                 "county" = c("AAA", "AAA", "BBB", "BBB", "CCC", "CCC")
                   )

The second table has a long format in which one column contains all unique values for BOTH the neighborhood and area (per the overarching county):

neighborhood_area_data <- data.frame(
"county" = c("AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "CCC", "CCC", "CCC"), 
"neighborhood_and_area" = c("NEIGH_A1", "AREA_AA1", "AREA_AA2", "NEIGH_A2", "AREA_BB2", "AREA_BB1", "NEIGH_B1", "NEIGH_C5", "NEIGH_C7", "AREA_CC3", "AREA_CC7"),
"type" = c("Neighborhood", "Area", "Area", "Neighborhood",  "Area", "Area", "Neighborhood", "Neighborhood", "Neighboordhood", "Area", "Area"),
"Number_of_Residents" = c(10, 50, 40, 30, 100, 70, 80, 60, 70, 70, 20),
"Average_Age" = c(55, 44, 33, 22, 66, 77, 55, 88, 99, 44, 11))

So for each overarching county you would have the data for all its existing areas and neighborhoods. ID's of which are stored into a single column, hence the long format. The "NEIGH_" and "AREA_" parts of the string identify whether it's a neighborhood or area and I remove those from the string to be able to join them).

In my example the data of interest would be the Number_of_Residents and Average_Age columns that I want to join onto the individual addresses table.

What I'm looking for is a solid approach/way to combine these tables (preferably via dplyr).

My initial approach was to take the second table and separate the neighborhood_and_area into separate columns (neighboorhood and area) whilst removing the identifier (e.g., "NEIGH_AA1"-> "AA1"part of the string). However because there is no summarizing/pivoting, the second table retains its original format and won't join properly. I am not sure what is the best/most elegant way to reconcile these two formats.

Hopefully my question and examples are clear! Thank you!

CodePudding user response：

Assuming you want both the neighborhood and area data stored:

library(tidyverse)

area_data <- 
    neighborhood_area_data %>% 
    separate(neighborhood_and_area, into = c(NA, 'code'), sep = '_') %>% 
    filter(grepl('Area', type)) %>% 
    rename(Area_Number_of_Residents = Number_of_Residents, 
           Area_Average_Age = Average_Age) %>%
    select(-type)

neighborhood_data <- 
    neighborhood_area_data %>% 
    separate(neighborhood_and_area, into = c(NA, 'code'), sep = '_') %>% 
    filter(!grepl('Area', type)) %>% 
    rename(Neighborhood_Number_of_Residents = Number_of_Residents, 
           Neighborhood_Average_Age = Average_Age) %>%
    select(-type)

You can then join on each of the split datasets:

adresses %>% 
    left_join(area_data, 
              by = c('county', 'area' = 'code')) %>% 
    left_join(neighborhood_data, 
              by = c('county', 'neighborhood' = 'code'))

Output:

  postal_code adress_nr neighborhood area county Area_Number_of_Residents Area_Average_Age Neighborhood_Number_of_Residents Neighborhood_Average_Age
1       1000A         1           A1  AA1    AAA                       50               44                               10                       55
2       1010A         2           A2  AA2    AAA                       40               33                               30                       22
3       1000B         3           B1  BB2    BBB                      100               66                               80                       55
4       1100B        15           B1  BB1    BBB                       70               77                               80                       55
5       1500C         1           C5  CC1    CCC                       NA               NA                               NA                       NA
6       2700C        35           C7  CC3    CCC                       70               44                               70                       99