I have two datasets, and I need to merge them by the ID value. The problems are:
- The ID value can be repeated across the same dataset (no other unique value is available).
- The two datasets are not equal in the rows number or the column numbers.
Example:
df1
ID | Gender |
---|---|
99 | Male |
85 | Female |
7 | Male |
df2
ID | Body_Temperature | Body_Temperature_date_time |
---|---|---|
99 | 36 | 1/1/2020 12:00 am |
99 | 38 | 2/1/2020 10:30 am |
99 | 37 | 1/1/2020 06:41 am |
52 | 38 | 1/2/2020 11:00 am |
11 | 39 | 4/5/2020 09:09 pm |
7 | 35 | 9/8/2020 02:30 am |
How can I turn these two datasets into one single dataset in a way that allows me to apply some machine learning models on it later on?
CodePudding user response:
Depending on your expected results, if you are wanting to return all rows from each dataframe, then you can use a full_join
from dplyr
:
library(dplyr)
full_join(df2, df1, by = "ID")
Or with base R:
merge(x=df2,y=df1,by="ID",all=TRUE)
Output
ID Body_Temperature Body_Temperature_date_time Gender
1 99 36 1/1/2020 12:00 am Male
2 99 38 2/1/2020 10:30 am Male
3 99 37 1/1/2020 06:41 am Male
4 52 38 1/2/2020 11:00 am <NA>
5 11 39 4/5/2020 09:09 pm <NA>
6 7 35 9/8/2020 02:30 am Male
7 85 NA <NA> Female
If you have more than 2 dataframes to combine, which only overlap with the ID
column, then you can use reduce
on a dataframe list (so put all the dataframes that you want to combine into a list):
library(tidyverse)
df_list <- list(df1, df2)
multi_full <- reduce(df_list, function(x, y, ...)
full_join(x, y, by = "ID", ...))
Or Reduce
with base R:
df_list <- list(df1, df2)
multi_full <- Reduce(function(x, y, ...)
merge(x, y, by = "ID", all = TRUE, ...), df_list)
Data
df1 <- structure(list(ID = c(99L, 85L, 7L), Gender = c("Male", "Female",
"Male")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(ID = c(99L, 99L, 99L, 52L, 11L, 7L), Body_Temperature = c(36L,
38L, 37L, 38L, 39L, 35L), Body_Temperature_date_time = c("1/1/2020 12:00 am",
"2/1/2020 10:30 am", "1/1/2020 06:41 am", "1/2/2020 11:00 am",
"4/5/2020 09:09 pm", "9/8/2020 02:30 am")), class = "data.frame", row.names = c(NA,
-6L))