Home > other >  merge/join multiple dataframe based on column ID
merge/join multiple dataframe based on column ID

Time:10-21

dataframe 1 :

    ID          species          sites   count
    411     Androsace halleri    A24C     2
    785     Bartsia alpina       A28B     1
    222     Carex cespitosa      C97Z     3
    125     Cicuta viros         D47S     1

dataframe 2 :

    ID          species          YEAR    observer   sites
    411     Androsace halleri    2018     James     A24C     
    785     Bartsia alpina       2019     Nikos     P89I
    222     Carex cespitosa      2018     Enzo      Y54K     
    125     Cicuta viros        2020     David     D47S     

wanted dataframe :

    ID          species          YEAR    observer   sites  count  sites.y
    411     Androsace halleri    2018     James     A24C     2 
    785     Bartsia alpina       2019     Nikos     A28B     1     P89I           
    222     Carex cespitosa      2018     Enzo      C97Z     3     Y54K           
    125     Cicuta viros         2020     David     D47S     1 

I did this:

wanted_dataset <- dataset_1 %>% 
  full_join(dataset_2, by="ID") 

but I get something like this :

ID         species.x       sites.x  count   species.y           YEAR    observer    sites.y
411     Androsace halleri   A24C      2      Androsace halleri  2018    James        A24C
785     Bartsia alpina      A28B      1      Bartsia alpina     2019    Nikos        P89I
222     Carex cespitosa     C97Z      3      Carex cespitosa    2018    Enzo         Y54K
125     Cicuta viros        D47S      1      Cicuta viros       2020    David        D47S

and I don't want the columns "species" to be repeated if all the rows of the dataframe 1 match with all the rows of the dataframe 2.

I'm aware I can do this:

 wanted_dataset <- test1 %>% 
    full_join(test2, by=c("ID", "species"))

but my actual problem concerns 14 dataframes containing each 30 columns where one third of them share the sames names between the dataframes so I was looking for a way to match and merge automatically columns that have the same names and have the same rows.

(tell me if I'm not clear I would try to better explain my problem)

CodePudding user response:

One way to go, but which can be very time consuming if you have "too many" data.frame is to create a function to check which columns are totally duplicates and merge by them. Then you can put your data.frames in a list and use your function on all data.frames with Reduce on that list.

mergeColOK <- function(x1, x2){
    # get common names
    colComm <- intersect(names(x1), names(x2))
    # combine all common column names 2 by 2
    combCol <- combn(colComm, 2)
    # for each pair of column names, check if all data are the same (if same value for first column, then same value for second columns
    colOK <- apply(combCol, 2, 
                     function(nomCols)all(x1[x1[[nomCols[1]]] %in% x2[[nomCols[1]]], nomCols]==x2[x2[[nomCols[1]]] %in% x1[[nomCols[1]]], nomCols]))
    # get all columns
      colToUse <- unique(as.vector(combCol[, colOK]))
    # finally, merge your data
      merge(x1, x2, by=colToUse, all=TRUE) # I prefer to use base R but `dplyr` join would work the same
}

the function on your 2 data.frames:

mergeColOK(df1, df2)
#         ID   species sites.x count YEAR observer sites.y
#1 Androsace   halleri    A24C     2 2018    James    A24C
#2   Bartsia    alpina    A28B     1 2019    Nikos    P89I
#3     Carex cespitosa    C97Z     3 2018     Enzo    Y54K
#4    Cicuta     viros    D47S     1 2020    David    D47S

Another option would be to merge only by ID and then check for complete duplicated columns and remove them (there are posts on SO explaining how to do that: Delete Redundant columns in R ; Identifying duplicate columns in a dataframe)

CodePudding user response:

Use setdiff:

wanted_dataset <- dataset_1 %>% 
  full_join(dataset_2, by=setdiff(colnames(.), "sites")) 

CodePudding user response:

I assume that the shared column are the same for all the dataframe, you can extract these column names and store them in a variable then do the join and use this variable for the by argument.

For example:

dat1 <- data.frame(A = LETTERS[1:5], B = 1:5, C = 5:9)                
dat2 <- data.frame(A = LETTERS[2:6], B = 2:6, D = 5:9)                

shared <- names(dat1)[names(dat1) %in% names(dat2)]
print(shared)
[1] "A" "B"
#join by A and B
result = inner_join(dat1, dat2, by = shared)

print(result)
A B C D
1 B 2 6 5
2 C 3 7 6
3 D 4 8 7
4 E 5 9 8
  • Related