Explain joining please?-CodePudding

I need some help understanding the concept of joining.

I understand how to mentally model how a join works if you have 2 data files that have a common variable. Like:

Animal	Weight	Age
Dog	12	5
Cat	4	19
Fish	2	4
Mouse	1	2

Animal	Award
Dog	1st
Cat	1st
Fish	3rd
Mouse	5th

These can be joined because the animal column is exactly the same and it just adds on another variable to the same observations of animals.

But I don't understand it when its something like this:

Mortality Rate (Heart Attack)

Year	Place	Death Rate (Heart Attack)
2011	Paris	200
2011	Paris	94
2011	Rome	23
2009	London	15

Mortality Rate (Car Crash)

Year	Place	Death Rate (Car Crash)
2011	London	987
2012	London	34
2012	Paris	09
2007	Melbourne	12

The variable TYPES are the same (years, cities and death rates). But the year values aren't the same, they arent in the same order, there arent the same number of 2011's for example, the locations are different, and there are obviously two different death rates that need to be two different columns, but how does this join work? Which variable would you join by? How would it be configured once joined? Would it just result in lots of NA values if this was across a larger data set?

I understand there are different types of joins that do different things, but I'm just struggling to understand how the years and cities would sit if you were wanting to be able to compare the two different death rates in cities and years.

Thank you!

CodePudding user response：

If you do

merge(heart, car, all=TRUE)
#   Year     Place Death_Rate_heart Death_Rate_Car
# 1 2007 Melbourne               NA             12
# 2 2009    London               15             NA
# 3 2011    London               NA            987
# 4 2011     Paris              200             NA
# 5 2011     Paris               94             NA
# 6 2011      Rome               23             NA
# 7 2012    London               NA             34
# 8 2012     Paris               NA              9

merge automatically looks for matching names and merges on them. It's looking for pairs in those columns, so they won't be mixed. More verbosely you could do

merge(heart, car, all=TRUE, by.x=c("Year", "Place"), by.y=c("Year", "Place"))

which is actually what happens in this case.

Data:

heart <- structure(list(Year = c(2011L, 2011L, 2011L, 2009L), Place = c("Paris", 
"Paris", "Rome", "London"), Death_Rate_heart = c(200L, 94L, 23L, 
15L)), class = "data.frame", row.names = c(NA, -4L))


car <- structure(list(Year = c(2011L, 2012L, 2012L, 2007L), Place = c("London", 
"London", "Paris", "Melbourne"), Death_Rate_Car = c(987L, 34L, 
9L, 12L)), class = "data.frame", row.names = c(NA, -4L))