I'm trying to join two dataframes. One is home sale data and the other is distance from a railroad. I am trying to join the data using a parcel ID as a key. The parcel ID in the sales data will have multiple values, because the parcels have sold several times. I want the data to join so that the distance values will repeat for each instance of the parcel ID, something like this.
Dataframe A:
ID | Sale Date | Sale Price |
---|---|---|
123 | 1/1/2020 | 320000 |
123 | 2/1/2021 | 429000 |
124 | 3/1/2019 | 190000 |
124 | 2/13/2020 | 280000 |
124 | 1/1/2022 | 419000 |
125 | 2/1/2021 | 300000 |
125 | 1/1/2022 | 390000 |
126 | 2/1/2021 | 310000 |
Dataframe B:
ID | Distance |
---|---|
123 | 1290 |
124 | 1809 |
125 | 370 |
126 | 976 |
I want the joined dataframe to look like this:
ID | Sale Date | Sale Price | Distance |
---|---|---|---|
123 | 1/1/2020 | 320000 | 1290 |
123 | 2/1/2021 | 429000 | 1290 |
124 | 3/1/2019 | 190000 | 1809 |
124 | 2/13/2020 | 280000 | 1809 |
124 | 1/1/2022 | 419000 | 1809 |
125 | 2/1/2021 | 300000 | 370 |
125 | 1/1/2022 | 390000 | 370 |
126 | 2/1/2021 | 310000 | 976 |
The distance values are copied for each parcel ID. When I use left_join I'm getting a new dataframe with more rows than either of the original dataframes (which is one problem) with a lot of NA values for distance (which is another problem). I don't understand why my new dataframe is larger than the largest of the original dataframes, and I don't know how to make it so that distance is repeated across each instance of the parcel ID and not just copied once and then NA thereafter.
When I searched for a solution for this problem, I mostly found methods of dropping duplicate rows, which I don't want. I want the rows in dataframe B to be copied for each instance of the ID in dataframe A.
CodePudding user response:
You can use left_join
by your ID
from the dplyr
package. You can use the following code:
df_A <- data.frame(ID = c(123, 123, 124, 124, 124, 125, 125, 126),
Sale_Date = c("1/1/2020", "2/1/2021", "3/1/2019", "2/13/2020", "1/1/2022", "2/1/2021", "1/1/2021", "2/1/2021"),
Sale_Price = c(320000, 429000, 190000, 280000, 419000, 300000, 390000, 310000))
df_B <- data.frame(ID = c(123, 124, 125, 126),
Distance = c(1290, 1809, 370, 976))
library(dplyr)
df_joined <- left_join(df_A, df_B, by = "ID")
df_joined
Output:
ID Sale_Date Sale_Price Distance
1 123 1/1/2020 320000 1290
2 123 2/1/2021 429000 1290
3 124 3/1/2019 190000 1809
4 124 2/13/2020 280000 1809
5 124 1/1/2022 419000 1809
6 125 2/1/2021 300000 370
7 125 1/1/2021 390000 370
8 126 2/1/2021 310000 976