Counting unique combinations of values in 1 column and 2 additional columns without duplicates-CodePudding

I have a large df of genomic data for many different tumor samples. One column "mutation" reports the specific variant of a protein detected in the sample. Two additional columns "allele1" and "allele2" report the HLA-type associated with each sample (each sample will have two values for HLA because it there are two copies of this in the genome). I would like to generate a count of samples with unique combinations of "mutation" and "allele1" OR "allele2" without counting duplicates (i.e. if a sample contained "mutation" mut1, allele1" a2 and "allele2" a2, it should be counted only once).

    df <- data.frame(mutation = c("mut1", "mut1"), allele1 = c("a1", "a2"), allele2 = c("a2", "a2"))

mutation allele1 allele2
mut1     a1      a2
mut1     a2      a2

I know I can use ddply in the following way:

qualities <- c("mutation", "allele1")
countedCombos <- ddply(df, qualities, nrow)

But how can I add a third column ("allele2") to my qualities parameter that is joined in an OR fashion to "allele1"? Running two separate analyses with "mutation" and "allele1" then "allele2" and then summing the counts doesn't work because for samples that have the same value for "allele1" and "allele2", they will be double counted.

Hope this is clear, tried to make it as generalizable as possible.

Thanks in advance!

My expected output for the sample data would be

df_count <- data.frame(mutation = c("mut1", "mut1"), allele = c("a1", "a2"), count = c(1, 2))

mutation allele count
mut1     a1     1
mut1     a2     2

CodePudding user response：

Thank you for editing your question to include the desired output - that makes more sense - here is one potential solution:

library(tidyverse)

df <- data.frame(mutation = c("mut1", "mut1"),
                 allele1 = c("a1", "a2"),
                 allele2 = c("a2", "a2"))

df_count <- data.frame(mutation = c("mut1", "mut1"),
                       allele = c("a1", "a2"),
                       count = c(1, 2))

df_count
#>   mutation allele count
#> 1     mut1     a1     1
#> 2     mut1     a2     2

df %>%
  pivot_longer(-mutation, values_to = "allele") %>%
  distinct() %>%
  group_by(mutation, allele) %>%
  tally(name = "count")
#> # A tibble: 2 × 3
#> # Groups:   mutation [1]
#>   mutation allele count
#>   <chr>    <chr>  <int>
#> 1 mut1     a1         1
#> 2 mut1     a2         2

^{Created on 2022-09-13 by the reprex package (v2.0.1)}

CodePudding user response：

I think Jared did most of the work, but this small alteration generates the output shown in the question:

library(tidyverse)

df %>% 
  pivot_longer(-mutation) %>% 
  distinct() %>% 
  count(mutation, value)

Result:

# A tibble: 2 × 3
  mutation value     n
  <chr>    <chr> <int>
1 mut1     a1        1
2 mut1     a2        2