I have two datasets, one for migration inflow to county A from other counties and other for migration outflow from county A to other counties. In order to combine the two data sets as:
County State Key Inflow Outflow Year
The common columns between the two datasets are Origin_Place
, Origin_StateName
and Year
in migration inflow
and Dest_place
, Dest_StateName
and Year
in migration outflow
.
Part of the problem is that the common columns have unequal number of rows. Another problem is that different counties can belong to the same state as shown in the dummy data below. So, what I am thinking is to have a unique key (Key) for both destination and origin county by concatenating
FIPS
(unique for each county) and Year
. That way I can combine the counties with their respective states and the rest of the associated columns values in a single row.
How can I combine the two into one dataset in such a way that I don't have to hardcode each and every common county and state name and FIPS and Year? Missing values would be NA
.
My original migration outflow data has 517 observations and migration inflow has 441, thus different number of counties in each dataset.
Desired output:
County State Key Inflow Outflow Year
A FL 12019 111 223 2019
A FL 12019 8888 224 2019
A FL 12019 NA 2333 2019
A FL 12019 NA 4444 2019
A FL 12019 NA 5555 2019
A FL 12019 NA 6666 2019
A FL 12019 NA 7777 2019
A FL 12020 9999 NA 2020
B BB 22019 223 NA 2019
C CC 32019 224 NA 2019
D FL 42019 2333 111 2019
E FL 52019 4444 8888 2019
F FL 62019 5555 9999 2020
G GG 72019 6666 NA 2019
H HH 82019 7777 NA 2019
Dummy data:
# People moving out of county A to other counties
Origin_Place = c("A", "A", "A", "A", "A", "A", "A")
FIPS_Origin_County = c(1, 1, 1, 1, 1, 1, 1)
Origin_StateName = c("FL", "FL", "FL", "FL", "FL", "FL", "FL")
Individuals = c(223, 224, 2333, 4444, 5555, 6666, 7777)
Dest_place = c("B", "C", "D", "E", "F", "G", "H")
FIPS_Dest_County = c(2, 3, 4, 5, 6, 7, 8)
Dest_StateName = c("BB", "CC", "FL", "FL", "FL", "GG", "HH")
Year = c(2019, 2019, 2019, 2019, 2020, 2020, 2020)
Outflow_df = data.frame(Origin_County_Name, FIPS_Origin_County, Origin_StateName, Individuals, Dest_place, FIPS_Dest_County, Dest_StateName, Year)
# People moving in county A from other counties
Origin_Place = c("D", "E", "F")
FIPS_Origin_County = c(5, 6, 7)
Origin_StateName = c("FL", "FL", "FL")
Individuals = c(111, 8888, 9999)
Dest_place = c("A", "A", "A")
FIPS_Dest_County = c(1, 1, 1)
Dest_StateName = c("FL", "FL", "FL")
Year = c(2019, 2019, 2020)
Inflow_df = data.frame(Origin_Place, FIPS_Origin_County, Origin_StateName, Individuals, Dest_County_Name, FIPS_Dest_County, Dest_StateName, Year)
CodePudding user response:
Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
bind_rows(lst(Inflow_df, Outflow_df), .id = 'datname') %>%
pivot_longer(cols = contains("_"), names_to = ".value",
names_pattern = ".*_([^_] $)") %>%
mutate(Key = str_c(County, Year), rn = rowid(Key, datname)) %>%
pivot_wider(names_from = datname, values_from = Individuals) %>%
arrange(rn) %>%
select(-rn)