I have a large table that is structured like this:
DT <- data.table(Sample = c(LETTERS[1:3]),
ColA = c(1:3),
ColB = c(3:1),
ColC = 1,
ColD = 2)
> DT
Sample ColA ColB ColC ColD
1: A 1 3 1 2
2: B 2 2 1 2
3: C 3 1 1 2
I wish to find all possible non-repeating combinations of 3 columns and then find the sum of the values in those columns for each sample.
The desired output would look like this.
Sample FirstCol SecondCol ThirdCol SumOfCols
1: A ColA ColB ColC 5
2: A ColA ColB ColD 6
3: A ColA ColC ColD 4
4: A ColB ColC ColD 6
5: B ColA ColB ColC 5
6: B ColA ColB ColD 6
7: B ColA ColC ColD 5
8: B ColB ColC ColD 5
9: C ColA ColB ColC 5
10: C ColA ColB ColD 6
11: C ColA ColC ColD 6
12: C ColB ColC ColD 4
Currently, I am using combn() to find all combinations of the columns, and then I attach the values to it by using merge() three times (once for each of the columns in the combination).
Combinations <- combn(colnames(DT[,!1]), 3)
Combinations <- as.data.table(t(Combinations))
lapply(DT$Sample, function(i){
Melted <- as.data.table(melt(DT[Sample == i], id.vars = "Sample"))[,!1]
Output <- merge(Combinations, Melted, by.x="V1", by.y="variable")
Output <- merge(Output, Melted, by.x="V2", by.y="variable")
Output <- merge(Output, Melted, by.x="V3", by.y="variable")
colnames(Output) <- c("ThirdCol", "SecondCol", "FirstCol", "FirstValue", "SecondValue","ThirdValue")
Output$SumOfCols <- Output$FirstValue Output$SecondValue Output$ThirdValue
Output$Sample <- i
Output <- Output[,c(8,3,2,1,7)]
})
While this achieves what I want, it is quite slow since my real dataset is rather large. I was wondering if anyone has a tip on how I could possibly speed this up.
Any help would be appreciated, Thank you!
CodePudding user response:
DT %>%
group_by(Sample) %>%
summarise(s = combn(cur_data(), 3, \(x)c(nms = names(x), Sum = rowSums(x)),
simplify = FALSE), .groups = 'drop') %>%
unnest_wider(s) %>%
type.convert(as.is = TRUE)
# A tibble: 12 x 5
Sample nms1 nms2 nms3 Sum
<chr> <chr> <chr> <chr> <int>
1 A ColA ColB ColC 5
2 A ColA ColB ColD 6
3 A ColA ColC ColD 4
4 A ColB ColC ColD 6
5 B ColA ColB ColC 5
6 B ColA ColB ColD 6
7 B ColA ColC ColD 5
8 B ColB ColC ColD 5
9 C ColA ColB ColC 5
10 C ColA ColB ColD 6
11 C ColA ColC ColD 6
12 C ColB ColC ColD 4