Home > Net >  How to correctly use dplyr for merging dataset in R
How to correctly use dplyr for merging dataset in R

Time:04-12

I'm trying to merge two datasets by the only column with the same name, but the result is a Dataframe with one dataset after the another, without any actual merging on the same row.

This is an example

File1
ID         Age 
GBI0061M    20
GBI0067M    21
GBI0069M    24

File2 
ID          Var1
GHU008F     0,55
GBI0067M    2,01
GFB0045F    1,50


I would like a file with only the common row

Filemerged: 
ID         Age       Var1
GBI0067    21         2,01

This is my R script

library(dplyr)
library(plyr)
File1 <- read.csv2("C:/Users/..............csv", sep = ";")


File2 <- read.csv("C:/Users.............csv", sep=";")


m3 <-merge(File1, File2, by.x = "ï..Codice", all.x= TRUE, all.y = TRUE)

or


m3 <- full_join(File1, File2, by.x = "ï..Codice", all.x= TRUE, all.y = TRUE)

I even tried with python .merge with the how="Outer" option with the same result. In excel the conditional formatting options does NOT recognize the same ID (execpt for the "i...Codice" even though they are the exact same string.. What should I do?

CodePudding user response:

You can control which column to use when joining. In the example below, the column to join on is specified explicitly.

Generic observations

  • Dplyr would use the column with the same name for the sake of explanation I've provided the column name. In this case this is c("ID" = "ID"). You could change that to reflect different columns available in each of the data sets
  • As your data sets have columns with identical names it is advisable to use meaningful suffixes that will point to the data origin. In this scenario, I'm using the names of the data set. By default, dplyr will use x and y as suffixes.
  • This column name is unnecessarily problematic by.x = "ï..Codice" and likely was created through make.names("ï Codice"). Whereas the string on it's own may be a synthetically correct column name it would be better to rename it to something easier to handle like id_codice
library("tidyverse")
dtA <- read_table(file = "
ID          Var1
GHU008F     0,55
GBI0067M    2,01
GFB0045F    1,50")
dtB <- read_table(file = "
ID          Var1
GHU008F     0,55
GBI0067M    2,01
GFB0045F    1,50")

full_join(
  x = dtA,
  y = dtB,
  by = c("ID" = "ID"),
  suffix = c("_dtA", "_dtB")
)
#> # A tibble: 3 × 3
#>   ID       Var1_dtA Var1_dtB
#>   <chr>    <chr>    <chr>   
#> 1 GHU008F  0,55     0,55    
#> 2 GBI0067M 2,01     2,01    
#> 3 GFB0045F 1,50     1,50

Created on 2022-04-12 by the reprex package (v2.0.1)

CodePudding user response:

Konrad gave a more precise answer

But a rather simple way to do this, that I use would be :

m3 <- left_join(File1, File2, .id=ID)

Hope it helps

  • Related