Home > front end >  R: How to join dataframes with a key that has duplicate values, and COPY the second frame's val
R: How to join dataframes with a key that has duplicate values, and COPY the second frame's val

Time:04-17

I'm trying to join two dataframes. One is home sale data and the other is distance from a railroad. I am trying to join the data using a parcel ID as a key. The parcel ID in the sales data will have multiple values, because the parcels have sold several times. I want the data to join so that the distance values will repeat for each instance of the parcel ID, something like this.

Dataframe A:

ID Sale Date Sale Price
123 1/1/2020 320000
123 2/1/2021 429000
124 3/1/2019 190000
124 2/13/2020 280000
124 1/1/2022 419000
125 2/1/2021 300000
125 1/1/2022 390000
126 2/1/2021 310000

Dataframe B:

ID Distance
123 1290
124 1809
125 370
126 976

I want the joined dataframe to look like this:

ID Sale Date Sale Price Distance
123 1/1/2020 320000 1290
123 2/1/2021 429000 1290
124 3/1/2019 190000 1809
124 2/13/2020 280000 1809
124 1/1/2022 419000 1809
125 2/1/2021 300000 370
125 1/1/2022 390000 370
126 2/1/2021 310000 976

The distance values are copied for each parcel ID. When I use left_join I'm getting a new dataframe with more rows than either of the original dataframes (which is one problem) with a lot of NA values for distance (which is another problem). I don't understand why my new dataframe is larger than the largest of the original dataframes, and I don't know how to make it so that distance is repeated across each instance of the parcel ID and not just copied once and then NA thereafter.

When I searched for a solution for this problem, I mostly found methods of dropping duplicate rows, which I don't want. I want the rows in dataframe B to be copied for each instance of the ID in dataframe A.

CodePudding user response:

You can use left_join by your ID from the dplyr package. You can use the following code:

df_A <- data.frame(ID = c(123, 123, 124, 124, 124, 125, 125, 126),
                   Sale_Date = c("1/1/2020", "2/1/2021", "3/1/2019", "2/13/2020", "1/1/2022", "2/1/2021", "1/1/2021", "2/1/2021"),
                   Sale_Price = c(320000, 429000, 190000, 280000, 419000, 300000, 390000, 310000))

df_B <- data.frame(ID = c(123, 124, 125, 126),
                   Distance = c(1290, 1809, 370, 976))

library(dplyr)
df_joined <- left_join(df_A, df_B, by = "ID")
df_joined

Output:

   ID Sale_Date Sale_Price Distance
1 123  1/1/2020     320000     1290
2 123  2/1/2021     429000     1290
3 124  3/1/2019     190000     1809
4 124 2/13/2020     280000     1809
5 124  1/1/2022     419000     1809
6 125  2/1/2021     300000      370
7 125  1/1/2021     390000      370
8 126  2/1/2021     310000      976
  • Related