How can I merge different data sets in R knowing that the variable that I use for matching the two d-CodePudding

I have two datasets, and I need to merge them by the ID value. The problems are:

The ID value can be repeated across the same dataset (no other unique value is available).
The two datasets are not equal in the rows number or the column numbers.

Example:

df1

ID	Gender
99	Male
85	Female
7	Male

df2

ID	Body_Temperature	Body_Temperature_date_time
99	36	1/1/2020 12:00 am
99	38	2/1/2020 10:30 am
99	37	1/1/2020 06:41 am
52	38	1/2/2020 11:00 am
11	39	4/5/2020 09:09 pm
7	35	9/8/2020 02:30 am

How can I turn these two datasets into one single dataset in a way that allows me to apply some machine learning models on it later on?

CodePudding user response：

Depending on your expected results, if you are wanting to return all rows from each dataframe, then you can use a full_join from dplyr:

library(dplyr)

full_join(df2, df1, by = "ID")

Or with base R:

merge(x=df2,y=df1,by="ID",all=TRUE)

Output

  ID Body_Temperature Body_Temperature_date_time Gender
1 99               36          1/1/2020 12:00 am   Male
2 99               38          2/1/2020 10:30 am   Male
3 99               37          1/1/2020 06:41 am   Male
4 52               38          1/2/2020 11:00 am   <NA>
5 11               39          4/5/2020 09:09 pm   <NA>
6  7               35          9/8/2020 02:30 am   Male
7 85               NA                       <NA> Female

If you have more than 2 dataframes to combine, which only overlap with the ID column, then you can use reduce on a dataframe list (so put all the dataframes that you want to combine into a list):

library(tidyverse)

df_list <- list(df1, df2)

multi_full <- reduce(df_list, function(x, y, ...)
  full_join(x, y, by = "ID", ...))

Or Reduce with base R:

df_list <- list(df1, df2)

multi_full <- Reduce(function(x, y, ...)
    merge(x, y, by = "ID", all = TRUE, ...), df_list)

Data

df1 <- structure(list(ID = c(99L, 85L, 7L), Gender = c("Male", "Female", 
"Male")), class = "data.frame", row.names = c(NA, -3L))

df2 <- structure(list(ID = c(99L, 99L, 99L, 52L, 11L, 7L), Body_Temperature = c(36L, 
38L, 37L, 38L, 39L, 35L), Body_Temperature_date_time = c("1/1/2020 12:00 am", 
"2/1/2020 10:30 am", "1/1/2020 06:41 am", "1/2/2020 11:00 am", 
"4/5/2020 09:09 pm", "9/8/2020 02:30 am")), class = "data.frame", row.names = c(NA, 
-6L))