Home > database >  Join datasets with different treatments
Join datasets with different treatments

Time:09-06

I wante to do a relatively simple thing, to join two datasets together. However, I have the problem that they include two different treatments and missing samples, which posses problems with the methods I tried.

Here the sample datasets:

data1 <- data.frame(Values = 1:5, Geno = c("a1", "a2","a3", "a4", "a5"), Treat = c("C", "C","C", "C", "C"))

data2 <- data.frame(Values = 2:4, Geno = c("a1", "a2", "a3"), Treat = c("DA", "DA","DA"))

I could just do rbind which gives this:

Values Geno Treat
1 a1 C
2 a2 C
3 a3 C
4 a4 C
5 a5 C
2 a1 DA
3 a2 DA
4 a3 DA

But ideally I want this:

Values Geno Treat
1 a1 C
2 a2 C
3 a3 C
2 a1 DA
3 a2 DA
4 a3 DA

-> so remove the missing "Geno" samples a4 a5 which are not in data2 compared to data1

Even better would be:

Values Geno Treat
1 a1 C
2 a2 C
3 a3 C
4 a4 C
5 a5 C
2 a1 DA
3 a2 DA
4 a3 DA
NA a4 DA
NA a5 DA

-> So put NA for the Value for the missing "Geno" a4 a5 in this Treatment "DA"

My Solution would be to use setdiff(data1$Geno, data2$Geno)to find out non overlapping samples and remove them, but I have the feeling this could be easier. Is there any faster method?

CodePudding user response:

Assuming the missing Genos are always in data2, you can do this:

data1 %>% 
  bind_rows(
    data2 %>% 
      complete(
        data1 %>% 
          anti_join(
            data2, 
            by="Geno"
          ) %>% 
          select(Geno), 
        fill=list(Treat="DA")
      )
  )
   Values Geno Treat
1       1   a1     C
2       2   a2     C
3       3   a3     C
4       4   a4     C
5       5   a5     C
6      NA   a4    DA
7      NA   a5    DA
8       2   a1    DA
9       3   a2    DA
10      4   a3    DA

Does that improve your original idea? I'm not sure there's going to be a one line solution as you're attempting to do three things simmultaneously: identify mismatches, impute a treatment for the mis matches and bind the data frames.

CodePudding user response:

You could row-bind data1 and data2 and then use complete it with missing combinations of Treat and Geno.

library(dplyr)
library(tidyr)

bind_rows(data1, data2) %>%
  complete(Treat, Geno)


# A tibble: 10 × 3
   Geno  Treat Values
   <chr> <chr>  <int>
 1 a1    C          1
 2 a2    C          2
 3 a3    C          3
 4 a4    C          4
 5 a5    C          5
 6 a1    DA         2
 7 a2    DA         3
 8 a3    DA         4
 9 a4    DA        NA
10 a5    DA        NA
  • Related