Home > other >  Explain joining please?
Explain joining please?

Time:04-11

I need some help understanding the concept of joining.

I understand how to mentally model how a join works if you have 2 data files that have a common variable. Like:

Animal Weight Age
Dog 12 5
Cat 4 19
Fish 2 4
Mouse 1 2
Animal Award
Dog 1st
Cat 1st
Fish 3rd
Mouse 5th

These can be joined because the animal column is exactly the same and it just adds on another variable to the same observations of animals.

But I don't understand it when its something like this:

Mortality Rate (Heart Attack)

Year Place Death Rate (Heart Attack)
2011 Paris 200
2011 Paris 94
2011 Rome 23
2009 London 15

Mortality Rate (Car Crash)

Year Place Death Rate (Car Crash)
2011 London 987
2012 London 34
2012 Paris 09
2007 Melbourne 12

The variable TYPES are the same (years, cities and death rates). But the year values aren't the same, they arent in the same order, there arent the same number of 2011's for example, the locations are different, and there are obviously two different death rates that need to be two different columns, but how does this join work? Which variable would you join by? How would it be configured once joined? Would it just result in lots of NA values if this was across a larger data set?

I understand there are different types of joins that do different things, but I'm just struggling to understand how the years and cities would sit if you were wanting to be able to compare the two different death rates in cities and years.

Thank you!

CodePudding user response:

If you do

merge(heart, car, all=TRUE)
#   Year     Place Death_Rate_heart Death_Rate_Car
# 1 2007 Melbourne               NA             12
# 2 2009    London               15             NA
# 3 2011    London               NA            987
# 4 2011     Paris              200             NA
# 5 2011     Paris               94             NA
# 6 2011      Rome               23             NA
# 7 2012    London               NA             34
# 8 2012     Paris               NA              9

merge automatically looks for matching names and merges on them. It's looking for pairs in those columns, so they won't be mixed. More verbosely you could do

merge(heart, car, all=TRUE, by.x=c("Year", "Place"), by.y=c("Year", "Place"))

which is actually what happens in this case.


Data:

heart <- structure(list(Year = c(2011L, 2011L, 2011L, 2009L), Place = c("Paris", 
"Paris", "Rome", "London"), Death_Rate_heart = c(200L, 94L, 23L, 
15L)), class = "data.frame", row.names = c(NA, -4L))


car <- structure(list(Year = c(2011L, 2012L, 2012L, 2007L), Place = c("London", 
"London", "Paris", "Melbourne"), Death_Rate_Car = c(987L, 34L, 
9L, 12L)), class = "data.frame", row.names = c(NA, -4L))
  • Related