I'm working with some public address data that I want to join together, but I'm not sure on the best approach and how to achieve this due to the different wider vs. longer formatting on the columns to join on.
My first table contains all addresses in the country; postal code adress number lead to unique combinations. Each address also is related to a specific neighborhood and area per county. This table does not contain any other info.
My second table contains relevant information regarding each neighborhood, area and county such as the number of residence, inhabitant ages, energy consumption, etc. The idea is simply to merge this information with the full addresses list so that I can view these statistics for each address in the country.
What gives me a headache is the different formatting of both tables.
The first table is formatted as follows such that every combination of address postal code is unique (but different addresses can be in the same county, area or neighborhood):
adresses <- data.frame("postal_code" = c("1000A", "1010A", "1000B", "1100B", "1500C", "2700C"),
"adress_nr" = c(1, 2, 3, 15, 1, 35),
"neighborhood" = c("A1", "A2", "B1", "B1", "C5", "C7"),
"area" = c("AA1", "AA2", "BB2", "BB1", "CC1", "CC3"),
"county" = c("AAA", "AAA", "BBB", "BBB", "CCC", "CCC")
)
The second table has a long format in which one column contains all unique values for BOTH the neighborhood and area (per the overarching county):
neighborhood_area_data <- data.frame(
"county" = c("AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "CCC", "CCC", "CCC"),
"neighborhood_and_area" = c("NEIGH_A1", "AREA_AA1", "AREA_AA2", "NEIGH_A2", "AREA_BB2", "AREA_BB1", "NEIGH_B1", "NEIGH_C5", "NEIGH_C7", "AREA_CC3", "AREA_CC7"),
"type" = c("Neighborhood", "Area", "Area", "Neighborhood", "Area", "Area", "Neighborhood", "Neighborhood", "Neighboordhood", "Area", "Area"),
"Number_of_Residents" = c(10, 50, 40, 30, 100, 70, 80, 60, 70, 70, 20),
"Average_Age" = c(55, 44, 33, 22, 66, 77, 55, 88, 99, 44, 11))
So for each overarching county you would have the data for all its existing areas and neighborhoods. ID's of which are stored into a single column, hence the long format. The "NEIGH_" and "AREA_" parts of the string identify whether it's a neighborhood or area and I remove those from the string to be able to join them).
In my example the data of interest would be the Number_of_Residents and Average_Age columns that I want to join onto the individual addresses table.
What I'm looking for is a solid approach/way to combine these tables (preferably via dplyr).
My initial approach was to take the second table and separate the neighborhood_and_area into separate columns (neighboorhood and area) whilst removing the identifier (e.g., "NEIGH_AA1"-> "AA1"part of the string). However because there is no summarizing/pivoting, the second table retains its original format and won't join properly. I am not sure what is the best/most elegant way to reconcile these two formats.
Hopefully my question and examples are clear! Thank you!
CodePudding user response:
Assuming you want both the neighborhood and area data stored:
library(tidyverse)
area_data <-
neighborhood_area_data %>%
separate(neighborhood_and_area, into = c(NA, 'code'), sep = '_') %>%
filter(grepl('Area', type)) %>%
rename(Area_Number_of_Residents = Number_of_Residents,
Area_Average_Age = Average_Age) %>%
select(-type)
neighborhood_data <-
neighborhood_area_data %>%
separate(neighborhood_and_area, into = c(NA, 'code'), sep = '_') %>%
filter(!grepl('Area', type)) %>%
rename(Neighborhood_Number_of_Residents = Number_of_Residents,
Neighborhood_Average_Age = Average_Age) %>%
select(-type)
You can then join on each of the split datasets:
adresses %>%
left_join(area_data,
by = c('county', 'area' = 'code')) %>%
left_join(neighborhood_data,
by = c('county', 'neighborhood' = 'code'))
Output:
postal_code adress_nr neighborhood area county Area_Number_of_Residents Area_Average_Age Neighborhood_Number_of_Residents Neighborhood_Average_Age
1 1000A 1 A1 AA1 AAA 50 44 10 55
2 1010A 2 A2 AA2 AAA 40 33 30 22
3 1000B 3 B1 BB2 BBB 100 66 80 55
4 1100B 15 B1 BB1 BBB 70 77 80 55
5 1500C 1 C5 CC1 CCC NA NA NA NA
6 2700C 35 C7 CC3 CCC 70 44 70 99