I have a several dataframes containing different information about the same individuals which I would like to merge.
N = 1000
district <- rnorm(N)
region <- rnorm(N)
household_id <- rnorm(N)
works <- rbinom(n = N, size = 1, prob = 0.05)
total_income <- rnorm(N)
total_expenditure <- rnorm(N)
df1 <- data.frame(district, region, household_id, works)
df2 <- data.frame(district, region, household_id, total_income)
df2 <- data.frame(district, region, household_id, total_expenditure)
Is there a way I can use the group_indices
function, or any other way to generate unique ID number for each observation in order to match individuals in df1
with those in df2
and df3
?
CodePudding user response:
To match the same rows with the same distric
, region
, and household_id
, just make another variable, id
for example, with:
df1$id <- paste0(df1$district, df1$region, df$household_id)
for each df and then merge them by this new id.
CodePudding user response:
I modified slightly your example: I guess your identifiers are the combination of district, region, and household_id. I give these variable categorical values:
district <- sample(letters,N,replace = T)
region <- sample(letters,N,replace = T)
household_id <- sample(letters,N,replace = T)
The easiest way: you define the table of all unique identifiers:
df <- unique(do.call(rbind,
list(df1[,c("district", "region", "household_id")],
df2[,c("district", "region", "household_id")],
df3[,c("district", "region", "household_id")]))
)
You can then create a unique ID:
df_corresp <- df %>%
mutate(ID = as.numeric(as.factor(paste0(district,region,household_id))))
That you can merge back in your original data frames:
df1 <- left_join(df1,df_corresp,by = c("district", "region", "household_id"))
df2 <- left_join(df2,df_corresp,by = c("district", "region", "household_id"))
df3 <- left_join(df3,df_corresp,by = c("district", "region", "household_id"))