Home > Enterprise >  How can I generate unique ID numbers for each observation to match across panel data
How can I generate unique ID numbers for each observation to match across panel data

Time:07-01

I have a several dataframes containing different information about the same individuals which I would like to merge.

N = 1000
district <- rnorm(N)
region <- rnorm(N)
household_id <- rnorm(N) 
works <- rbinom(n = N, size = 1, prob = 0.05)
total_income <- rnorm(N) 
total_expenditure <- rnorm(N) 

df1 <- data.frame(district, region, household_id, works)
df2 <- data.frame(district, region, household_id, total_income)
df2 <- data.frame(district, region, household_id, total_expenditure)

Is there a way I can use the group_indices function, or any other way to generate unique ID number for each observation in order to match individuals in df1 with those in df2 and df3?

CodePudding user response:

To match the same rows with the same distric, region, and household_id, just make another variable, id for example, with:

df1$id <- paste0(df1$district, df1$region, df$household_id)

for each df and then merge them by this new id.

CodePudding user response:

I modified slightly your example: I guess your identifiers are the combination of district, region, and household_id. I give these variable categorical values:

district <- sample(letters,N,replace = T)
region <- sample(letters,N,replace = T)
household_id <- sample(letters,N,replace = T)

The easiest way: you define the table of all unique identifiers:

df <- unique(do.call(rbind,
               list(df1[,c("district", "region", "household_id")],
                    df2[,c("district", "region", "household_id")],
                    df3[,c("district", "region", "household_id")]))
       )

You can then create a unique ID:

df_corresp <- df %>% 
  mutate(ID = as.numeric(as.factor(paste0(district,region,household_id))))

That you can merge back in your original data frames:

df1 <- left_join(df1,df_corresp,by = c("district", "region", "household_id"))
df2 <- left_join(df2,df_corresp,by = c("district", "region", "household_id"))
df3 <- left_join(df3,df_corresp,by = c("district", "region", "household_id"))

  • Related