I need some help understanding the concept of joining.
I understand how to mentally model how a join works if you have 2 data files that have a common variable. Like:
Animal | Weight | Age |
---|---|---|
Dog | 12 | 5 |
Cat | 4 | 19 |
Fish | 2 | 4 |
Mouse | 1 | 2 |
Animal | Award |
---|---|
Dog | 1st |
Cat | 1st |
Fish | 3rd |
Mouse | 5th |
These can be joined because the animal column is exactly the same and it just adds on another variable to the same observations of animals.
But I don't understand it when its something like this:
Mortality Rate (Heart Attack)
Year | Place | Death Rate (Heart Attack) |
---|---|---|
2011 | Paris | 200 |
2011 | Paris | 94 |
2011 | Rome | 23 |
2009 | London | 15 |
Mortality Rate (Car Crash)
Year | Place | Death Rate (Car Crash) |
---|---|---|
2011 | London | 987 |
2012 | London | 34 |
2012 | Paris | 09 |
2007 | Melbourne | 12 |
The variable TYPES are the same (years, cities and death rates). But the year values aren't the same, they arent in the same order, there arent the same number of 2011's for example, the locations are different, and there are obviously two different death rates that need to be two different columns, but how does this join work? Which variable would you join by? How would it be configured once joined? Would it just result in lots of NA values if this was across a larger data set?
I understand there are different types of joins that do different things, but I'm just struggling to understand how the years and cities would sit if you were wanting to be able to compare the two different death rates in cities and years.
Thank you!
CodePudding user response:
If you do
merge(heart, car, all=TRUE)
# Year Place Death_Rate_heart Death_Rate_Car
# 1 2007 Melbourne NA 12
# 2 2009 London 15 NA
# 3 2011 London NA 987
# 4 2011 Paris 200 NA
# 5 2011 Paris 94 NA
# 6 2011 Rome 23 NA
# 7 2012 London NA 34
# 8 2012 Paris NA 9
merge
automatically looks for matching names
and merges on them. It's looking for pairs in those columns, so they won't be mixed. More verbosely you could do
merge(heart, car, all=TRUE, by.x=c("Year", "Place"), by.y=c("Year", "Place"))
which is actually what happens in this case.
Data:
heart <- structure(list(Year = c(2011L, 2011L, 2011L, 2009L), Place = c("Paris",
"Paris", "Rome", "London"), Death_Rate_heart = c(200L, 94L, 23L,
15L)), class = "data.frame", row.names = c(NA, -4L))
car <- structure(list(Year = c(2011L, 2012L, 2012L, 2007L), Place = c("London",
"London", "Paris", "Melbourne"), Death_Rate_Car = c(987L, 34L,
9L, 12L)), class = "data.frame", row.names = c(NA, -4L))