I'm trying to merge two datasets by the only column with the same name, but the result is a Dataframe with one dataset after the another, without any actual merging on the same row.
This is an example
File1
ID Age
GBI0061M 20
GBI0067M 21
GBI0069M 24
File2
ID Var1
GHU008F 0,55
GBI0067M 2,01
GFB0045F 1,50
I would like a file with only the common row
Filemerged:
ID Age Var1
GBI0067 21 2,01
This is my R script
library(dplyr)
library(plyr)
File1 <- read.csv2("C:/Users/..............csv", sep = ";")
File2 <- read.csv("C:/Users.............csv", sep=";")
m3 <-merge(File1, File2, by.x = "ï..Codice", all.x= TRUE, all.y = TRUE)
or
m3 <- full_join(File1, File2, by.x = "ï..Codice", all.x= TRUE, all.y = TRUE)
I even tried with python .merge with the how="Outer" option with the same result. In excel the conditional formatting options does NOT recognize the same ID (execpt for the "i...Codice" even though they are the exact same string.. What should I do?
CodePudding user response:
You can control which column to use when joining. In the example below, the column to join on is specified explicitly.
Generic observations
- Dplyr would use the column with the same name for the sake of explanation I've provided the column name. In this case this is
c("ID" = "ID")
. You could change that to reflect different columns available in each of the data sets - As your data sets have columns with identical names it is advisable to use meaningful suffixes that will point to the data origin. In this scenario, I'm using the names of the data set. By default, dplyr will use
x
andy
as suffixes. - This column name is unnecessarily problematic
by.x = "ï..Codice"
and likely was created throughmake.names("ï Codice")
. Whereas the string on it's own may be a synthetically correct column name it would be better to rename it to something easier to handle likeid_codice
library("tidyverse")
dtA <- read_table(file = "
ID Var1
GHU008F 0,55
GBI0067M 2,01
GFB0045F 1,50")
dtB <- read_table(file = "
ID Var1
GHU008F 0,55
GBI0067M 2,01
GFB0045F 1,50")
full_join(
x = dtA,
y = dtB,
by = c("ID" = "ID"),
suffix = c("_dtA", "_dtB")
)
#> # A tibble: 3 × 3
#> ID Var1_dtA Var1_dtB
#> <chr> <chr> <chr>
#> 1 GHU008F 0,55 0,55
#> 2 GBI0067M 2,01 2,01
#> 3 GFB0045F 1,50 1,50
Created on 2022-04-12 by the reprex package (v2.0.1)
CodePudding user response:
Konrad gave a more precise answer
But a rather simple way to do this, that I use would be :
m3 <- left_join(File1, File2, .id=ID)
Hope it helps