Home > Blockchain >  Finding sum of all possible column combinations without repetition
Finding sum of all possible column combinations without repetition

Time:05-26

I have a large table that is structured like this:

DT <- data.table(Sample = c(LETTERS[1:3]),
                ColA = c(1:3),
                ColB = c(3:1),
                ColC = 1,
                ColD = 2)

> DT
   Sample ColA ColB ColC ColD
1:      A    1    3    1    2
2:      B    2    2    1    2
3:      C    3    1    1    2

I wish to find all possible non-repeating combinations of 3 columns and then find the sum of the values in those columns for each sample.

The desired output would look like this.

    Sample FirstCol SecondCol ThirdCol SumOfCols
 1:      A     ColA      ColB     ColC         5
 2:      A     ColA      ColB     ColD         6
 3:      A     ColA      ColC     ColD         4
 4:      A     ColB      ColC     ColD         6
 5:      B     ColA      ColB     ColC         5
 6:      B     ColA      ColB     ColD         6
 7:      B     ColA      ColC     ColD         5
 8:      B     ColB      ColC     ColD         5
 9:      C     ColA      ColB     ColC         5
10:      C     ColA      ColB     ColD         6
11:      C     ColA      ColC     ColD         6
12:      C     ColB      ColC     ColD         4

Currently, I am using combn() to find all combinations of the columns, and then I attach the values to it by using merge() three times (once for each of the columns in the combination).


Combinations <- combn(colnames(DT[,!1]), 3)
Combinations <- as.data.table(t(Combinations))

lapply(DT$Sample, function(i){
  Melted <- as.data.table(melt(DT[Sample == i], id.vars = "Sample"))[,!1]
  Output <- merge(Combinations, Melted, by.x="V1", by.y="variable")
  Output <- merge(Output, Melted, by.x="V2", by.y="variable")
  Output <- merge(Output, Melted, by.x="V3", by.y="variable")
  colnames(Output) <- c("ThirdCol", "SecondCol", "FirstCol", "FirstValue", "SecondValue","ThirdValue")
  Output$SumOfCols <- Output$FirstValue   Output$SecondValue   Output$ThirdValue
  Output$Sample <- i
  Output <- Output[,c(8,3,2,1,7)] 
})

While this achieves what I want, it is quite slow since my real dataset is rather large. I was wondering if anyone has a tip on how I could possibly speed this up.

Any help would be appreciated, Thank you!

CodePudding user response:

DT %>%
  group_by(Sample) %>%
  summarise(s = combn(cur_data(), 3,  \(x)c(nms = names(x), Sum = rowSums(x)), 
                      simplify = FALSE),    .groups = 'drop') %>%
  unnest_wider(s) %>%
  type.convert(as.is = TRUE)

# A tibble: 12 x 5
   Sample nms1  nms2  nms3    Sum
   <chr>  <chr> <chr> <chr> <int>
 1 A      ColA  ColB  ColC      5
 2 A      ColA  ColB  ColD      6
 3 A      ColA  ColC  ColD      4
 4 A      ColB  ColC  ColD      6
 5 B      ColA  ColB  ColC      5
 6 B      ColA  ColB  ColD      6
 7 B      ColA  ColC  ColD      5
 8 B      ColB  ColC  ColD      5
 9 C      ColA  ColB  ColC      5
10 C      ColA  ColB  ColD      6
11 C      ColA  ColC  ColD      6
12 C      ColB  ColC  ColD      4
  • Related