How to merge two datasets in R?-CodePudding

I am currently working to merge two datasets in R. The first is a cross-national longitudinal dataset of democracy scores and inequality levels for countries over hundreds of years (15,034 observations, dat_as). The second is a cross-national longitudinal dataset of whether a given country in a given year has a legislature (27,192 observations, dat_vdem). I want to attach the legislatures data to the inequality data. The goal is to have a final df with the same number of observations (15,034). If there is a match, merge the data. If there is not a match, just insert an NA for the row. Every approach I have tried in R does not work. For example, using this code I get a df with 2,558,975 observations.

# load data
dat_as <- read.csv("as.csv")
dat_vdem <- read.csv("vdem.csv")

# merge 
test_df <- merge(dat_as, dat_vdem, by = c("code"))

Using this code, however, I get a df with 13,355 observations.

test_df <- merge(dat_as, dat_vdem, by = c("country", "year"))

What am I doing wrong? Any help would be appreciated. Below are reproducible data.

Here is the dat_as:

structure(list(X = 1:6, country = c("United States", "United States", 
"United States", "United States", "United States", "United States"
), year = 1800:1805, scode = c("USA", "USA", "USA", "USA", "USA", 
"USA"), code = c("USA", "USA", "USA", "USA", "USA", "USA"), democracy = c(1L, 
1L, 1L, 1L, 1L, 1L), lagdemocracy = c(NA, 1L, 1L, 1L, 1L, 1L), 
    lbmginiint = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), lbmgdppint = c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), ldemlbmginiint = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), ldemlbmgdppint = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), yearsq = c(3240000, 
    3243601, 3247204, 3250809, 3254416, 3258025), legislature = c(NA, 
    NA, NA, NA, NA, NA)), row.names = c(NA, 6L), class = "data.frame")

Here is the dat_vdem:

structure(list(X = 1:6, year = 1800:1805, country = c("United States", "United States", "United States", "United States", "United States", "United States"), code = c("USA", 
"USA", "USA", "USA", "USA", "USA"), v2lgbicam = c(0L, 0L, 0L, 
0L, 0L, 0L), v2lgqstexp = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), v2lgotovst = c(-2.1, -2.1, -2.1, -2.1, -2.1, 
-2.1), v2lginvstp = c(-2.05, -2.05, -2.05, -2.05, -2.05, -2.05
), legislature = c(0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 
6L), class = "data.frame")

CodePudding user response：

You're describing a left join. The way I find easier is to use dplyr.

dplyr::left_join(dat_as, dat_vdem).

By default it will try and guess which key variables to match by. With the sample data you provided, it matched by "X", "country", "year", "code", "legislature". But you can specify them if need be.