R: How to join dataframes with a key that has duplicate values, and COPY the second frame's val-CodePudding

I'm trying to join two dataframes. One is home sale data and the other is distance from a railroad. I am trying to join the data using a parcel ID as a key. The parcel ID in the sales data will have multiple values, because the parcels have sold several times. I want the data to join so that the distance values will repeat for each instance of the parcel ID, something like this.

Dataframe A:

ID	Sale Date	Sale Price
123	1/1/2020	320000
123	2/1/2021	429000
124	3/1/2019	190000
124	2/13/2020	280000
124	1/1/2022	419000
125	2/1/2021	300000
125	1/1/2022	390000
126	2/1/2021	310000

Dataframe B:

ID	Distance
123	1290
124	1809
125	370
126	976

I want the joined dataframe to look like this:

ID	Sale Date	Sale Price	Distance
123	1/1/2020	320000	1290
123	2/1/2021	429000	1290
124	3/1/2019	190000	1809
124	2/13/2020	280000	1809
124	1/1/2022	419000	1809
125	2/1/2021	300000	370
125	1/1/2022	390000	370
126	2/1/2021	310000	976

The distance values are copied for each parcel ID. When I use left_join I'm getting a new dataframe with more rows than either of the original dataframes (which is one problem) with a lot of NA values for distance (which is another problem). I don't understand why my new dataframe is larger than the largest of the original dataframes, and I don't know how to make it so that distance is repeated across each instance of the parcel ID and not just copied once and then NA thereafter.

When I searched for a solution for this problem, I mostly found methods of dropping duplicate rows, which I don't want. I want the rows in dataframe B to be copied for each instance of the ID in dataframe A.

CodePudding user response：

You can use left_join by your ID from the dplyr package. You can use the following code:

df_A <- data.frame(ID = c(123, 123, 124, 124, 124, 125, 125, 126),
                   Sale_Date = c("1/1/2020", "2/1/2021", "3/1/2019", "2/13/2020", "1/1/2022", "2/1/2021", "1/1/2021", "2/1/2021"),
                   Sale_Price = c(320000, 429000, 190000, 280000, 419000, 300000, 390000, 310000))

df_B <- data.frame(ID = c(123, 124, 125, 126),
                   Distance = c(1290, 1809, 370, 976))

library(dplyr)
df_joined <- left_join(df_A, df_B, by = "ID")
df_joined

Output:

   ID Sale_Date Sale_Price Distance
1 123  1/1/2020     320000     1290
2 123  2/1/2021     429000     1290
3 124  3/1/2019     190000     1809
4 124 2/13/2020     280000     1809
5 124  1/1/2022     419000     1809
6 125  2/1/2021     300000      370
7 125  1/1/2021     390000      370
8 126  2/1/2021     310000      976