I wante to do a relatively simple thing, to join two datasets together. However, I have the problem that they include two different treatments and missing samples, which posses problems with the methods I tried.
Here the sample datasets:
data1 <- data.frame(Values = 1:5, Geno = c("a1", "a2","a3", "a4", "a5"), Treat = c("C", "C","C", "C", "C"))
data2 <- data.frame(Values = 2:4, Geno = c("a1", "a2", "a3"), Treat = c("DA", "DA","DA"))
I could just do rbind which gives this:
Values | Geno | Treat |
---|---|---|
1 | a1 | C |
2 | a2 | C |
3 | a3 | C |
4 | a4 | C |
5 | a5 | C |
2 | a1 | DA |
3 | a2 | DA |
4 | a3 | DA |
But ideally I want this:
Values | Geno | Treat |
---|---|---|
1 | a1 | C |
2 | a2 | C |
3 | a3 | C |
2 | a1 | DA |
3 | a2 | DA |
4 | a3 | DA |
-> so remove the missing "Geno" samples a4 a5 which are not in data2 compared to data1
Even better would be:
Values | Geno | Treat |
---|---|---|
1 | a1 | C |
2 | a2 | C |
3 | a3 | C |
4 | a4 | C |
5 | a5 | C |
2 | a1 | DA |
3 | a2 | DA |
4 | a3 | DA |
NA | a4 | DA |
NA | a5 | DA |
-> So put NA for the Value for the missing "Geno" a4 a5 in this Treatment "DA"
My Solution would be to use setdiff(data1$Geno, data2$Geno)
to find out non overlapping samples and remove them, but I have the feeling this could be easier.
Is there any faster method?
CodePudding user response:
Assuming the missing Geno
s are always in data2
, you can do this:
data1 %>%
bind_rows(
data2 %>%
complete(
data1 %>%
anti_join(
data2,
by="Geno"
) %>%
select(Geno),
fill=list(Treat="DA")
)
)
Values Geno Treat
1 1 a1 C
2 2 a2 C
3 3 a3 C
4 4 a4 C
5 5 a5 C
6 NA a4 DA
7 NA a5 DA
8 2 a1 DA
9 3 a2 DA
10 4 a3 DA
Does that improve your original idea? I'm not sure there's going to be a one line solution as you're attempting to do three things simmultaneously: identify mismatches, impute a treatment for the mis matches and bind the data frames.
CodePudding user response:
You could row-bind data1
and data2
and then use complete it with missing combinations of Treat
and Geno
.
library(dplyr)
library(tidyr)
bind_rows(data1, data2) %>%
complete(Treat, Geno)
# A tibble: 10 × 3
Geno Treat Values
<chr> <chr> <int>
1 a1 C 1
2 a2 C 2
3 a3 C 3
4 a4 C 4
5 a5 C 5
6 a1 DA 2
7 a2 DA 3
8 a3 DA 4
9 a4 DA NA
10 a5 DA NA