Combining/ merging information from dataframes -- very specific output format wanted-CodePudding

I want to do a very specific type of merge (working with taxonomy data from microbe sequencing). For example, if I have two dataframes, each from a different sample (sample 1 and 2):

df1 <- data.frame(lineage = c("1;131567", "1;131567;2;1224", "28216;32003;32011"),
                    taxonomy = c("bacteria", "archaea", "virus"),
                    count = c(15, 34, 12))
print(df1)
            lineage taxonomy count
1          1;131567 bacteria    15
2   1;131567;2;1224  archaea    34
3 28216;32003;32011    virus    12

df2 <- data.frame(lineage = c("204457;41297;165696", "1;131567;2;1224", "28216;32003;32011", "1;131567"),
                  taxonomy = c("fungi", "archaea", "virus", "bacteria"),
                  count = c(5, 34, 12, 11))
print(df2)
              lineage taxonomy count
1 204457;41297;165696    fungi     5
2     1;131567;2;1224  archaea    34
3   28216;32003;32011    virus    12
4            1;131567 bacteria    11

The resulting output of merging the two dataframes I want to have one column named "lineage" that contains all the different unique lineages from both samples. I want the taxonomy column to be retained (this uniquely corresponds to lineage). And I want counts to be displayed after, so df1 (Sample 1) will have its own column for "count", and if any missing values from introducing lineages found in Sample 2 are shown as zero. "count" from df2 will be a column after the count column from df1.

For example, the final output should look like:

output <- data.frame(lineage = c("1;131567", "1;131567;2;1224", "28216;32003;32011", "204457;41297;165696"),
                  taxonomy = c("bacteria", "archaea", "virus", "fungi"),
                  count_df1 = c(15, 34, 12, NA),
                  count_df2 = c(11, 34, 12, 5))
print(output)
              lineage taxonomy count_df1 count_df2
1            1;131567 bacteria        15        11
2     1;131567;2;1224  archaea        34        34
3   28216;32003;32011    virus        12        12
4 204457;41297;165696    fungi        NA         5

Each dataframe of the actual samples has >5000 rows, and I plan to merge 5 dataframes into one sequentially using the merge option. Any help would be appreciated!!

I have looked into merge() as well as the dplyr join() functions, but cannot make sense of how to retain the information in the way I want. Willing to try other non-merge options as well (especially considering I am planning to merge 6 dataframes, not just 2 as in the example).

CodePudding user response：

The type of join you are looking for is a full_join:

library(dplyr)

df1 |> 
  full_join(df2, by = c("lineage", "taxonomy"), suffix = c("_df1", "_df2"))  
#>               lineage taxonomy count_df1 count_df2
#> 1            1;131567 bacteria        15        11
#> 2     1;131567;2;1224  archaea        34        34
#> 3   28216;32003;32011    virus        12        12
#> 4 204457;41297;165696    fungi        NA         5

Or using merge:

merge(df1, df2, by = c("lineage", "taxonomy"), all = TRUE)
#>               lineage taxonomy count.x count.y
#> 1            1;131567 bacteria      15      11
#> 2     1;131567;2;1224  archaea      34      34
#> 3 204457;41297;165696    fungi      NA       5
#> 4   28216;32003;32011    virus      12      12