dataframe 1 :
ID species sites count
411 Androsace halleri A24C 2
785 Bartsia alpina A28B 1
222 Carex cespitosa C97Z 3
125 Cicuta viros D47S 1
dataframe 2 :
ID species YEAR observer sites
411 Androsace halleri 2018 James A24C
785 Bartsia alpina 2019 Nikos P89I
222 Carex cespitosa 2018 Enzo Y54K
125 Cicuta viros 2020 David D47S
wanted dataframe :
ID species YEAR observer sites count sites.y
411 Androsace halleri 2018 James A24C 2
785 Bartsia alpina 2019 Nikos A28B 1 P89I
222 Carex cespitosa 2018 Enzo C97Z 3 Y54K
125 Cicuta viros 2020 David D47S 1
I did this:
wanted_dataset <- dataset_1 %>%
full_join(dataset_2, by="ID")
but I get something like this :
ID species.x sites.x count species.y YEAR observer sites.y
411 Androsace halleri A24C 2 Androsace halleri 2018 James A24C
785 Bartsia alpina A28B 1 Bartsia alpina 2019 Nikos P89I
222 Carex cespitosa C97Z 3 Carex cespitosa 2018 Enzo Y54K
125 Cicuta viros D47S 1 Cicuta viros 2020 David D47S
and I don't want the columns "species" to be repeated if all the rows of the dataframe 1 match with all the rows of the dataframe 2.
I'm aware I can do this:
wanted_dataset <- test1 %>%
full_join(test2, by=c("ID", "species"))
but my actual problem concerns 14 dataframes containing each 30 columns where one third of them share the sames names between the dataframes so I was looking for a way to match and merge automatically columns that have the same names and have the same rows.
(tell me if I'm not clear I would try to better explain my problem)
CodePudding user response:
One way to go, but which can be very time consuming if you have "too many" data.frame is to create a function to check which columns are totally duplicates and merge by them. Then you can put your data.frames in a list and use your function on all data.frames with Reduce
on that list.
mergeColOK <- function(x1, x2){
# get common names
colComm <- intersect(names(x1), names(x2))
# combine all common column names 2 by 2
combCol <- combn(colComm, 2)
# for each pair of column names, check if all data are the same (if same value for first column, then same value for second columns
colOK <- apply(combCol, 2,
function(nomCols)all(x1[x1[[nomCols[1]]] %in% x2[[nomCols[1]]], nomCols]==x2[x2[[nomCols[1]]] %in% x1[[nomCols[1]]], nomCols]))
# get all columns
colToUse <- unique(as.vector(combCol[, colOK]))
# finally, merge your data
merge(x1, x2, by=colToUse, all=TRUE) # I prefer to use base R but `dplyr` join would work the same
}
the function on your 2 data.frames:
mergeColOK(df1, df2)
# ID species sites.x count YEAR observer sites.y
#1 Androsace halleri A24C 2 2018 James A24C
#2 Bartsia alpina A28B 1 2019 Nikos P89I
#3 Carex cespitosa C97Z 3 2018 Enzo Y54K
#4 Cicuta viros D47S 1 2020 David D47S
Another option would be to merge only by ID and then check for complete duplicated columns and remove them (there are posts on SO explaining how to do that: Delete Redundant columns in R ; Identifying duplicate columns in a dataframe)
CodePudding user response:
Use setdiff
:
wanted_dataset <- dataset_1 %>%
full_join(dataset_2, by=setdiff(colnames(.), "sites"))
CodePudding user response:
I assume that the shared column are the same for all the dataframe, you can extract these column names and store them in a variable then do the join
and use this variable for the by
argument.
For example:
dat1 <- data.frame(A = LETTERS[1:5], B = 1:5, C = 5:9)
dat2 <- data.frame(A = LETTERS[2:6], B = 2:6, D = 5:9)
shared <- names(dat1)[names(dat1) %in% names(dat2)]
print(shared)
[1] "A" "B"
#join by A and B
result = inner_join(dat1, dat2, by = shared)
print(result)
A | B | C | D | |
---|---|---|---|---|
1 | B | 2 | 6 | 5 |
2 | C | 3 | 7 | 6 |
3 | D | 4 | 8 | 7 |
4 | E | 5 | 9 | 8 |