I want to do a very specific type of merge (working with taxonomy data from microbe sequencing). For example, if I have two dataframes, each from a different sample (sample 1 and 2):
df1 <- data.frame(lineage = c("1;131567", "1;131567;2;1224", "28216;32003;32011"),
taxonomy = c("bacteria", "archaea", "virus"),
count = c(15, 34, 12))
print(df1)
lineage taxonomy count
1 1;131567 bacteria 15
2 1;131567;2;1224 archaea 34
3 28216;32003;32011 virus 12
df2 <- data.frame(lineage = c("204457;41297;165696", "1;131567;2;1224", "28216;32003;32011", "1;131567"),
taxonomy = c("fungi", "archaea", "virus", "bacteria"),
count = c(5, 34, 12, 11))
print(df2)
lineage taxonomy count
1 204457;41297;165696 fungi 5
2 1;131567;2;1224 archaea 34
3 28216;32003;32011 virus 12
4 1;131567 bacteria 11
The resulting output of merging the two dataframes I want to have one column named "lineage" that contains all the different unique lineages from both samples. I want the taxonomy column to be retained (this uniquely corresponds to lineage). And I want counts to be displayed after, so df1 (Sample 1) will have its own column for "count", and if any missing values from introducing lineages found in Sample 2 are shown as zero. "count" from df2 will be a column after the count column from df1.
For example, the final output should look like:
output <- data.frame(lineage = c("1;131567", "1;131567;2;1224", "28216;32003;32011", "204457;41297;165696"),
taxonomy = c("bacteria", "archaea", "virus", "fungi"),
count_df1 = c(15, 34, 12, NA),
count_df2 = c(11, 34, 12, 5))
print(output)
lineage taxonomy count_df1 count_df2
1 1;131567 bacteria 15 11
2 1;131567;2;1224 archaea 34 34
3 28216;32003;32011 virus 12 12
4 204457;41297;165696 fungi NA 5
Each dataframe of the actual samples has >5000 rows, and I plan to merge 5 dataframes into one sequentially using the merge option. Any help would be appreciated!!
I have looked into merge() as well as the dplyr join() functions, but cannot make sense of how to retain the information in the way I want. Willing to try other non-merge options as well (especially considering I am planning to merge 6 dataframes, not just 2 as in the example).
CodePudding user response:
The type of join you are looking for is a full_join
:
library(dplyr)
df1 |>
full_join(df2, by = c("lineage", "taxonomy"), suffix = c("_df1", "_df2"))
#> lineage taxonomy count_df1 count_df2
#> 1 1;131567 bacteria 15 11
#> 2 1;131567;2;1224 archaea 34 34
#> 3 28216;32003;32011 virus 12 12
#> 4 204457;41297;165696 fungi NA 5
Or using merge
:
merge(df1, df2, by = c("lineage", "taxonomy"), all = TRUE)
#> lineage taxonomy count.x count.y
#> 1 1;131567 bacteria 15 11
#> 2 1;131567;2;1224 archaea 34 34
#> 3 204457;41297;165696 fungi NA 5
#> 4 28216;32003;32011 virus 12 12