I'm looking for a fast yet readable solution to this problem in R. Preferably the solution should use the data.table
package or no additional packages although I'd like to hear of other options.
I have a data.table
with two columns like this one:
dat
go_id gene_id
1: A a
2: A b
3: B c
4: B d
5: B e
6: C f
7: C g
8: C h
9: C i
10: C j
You can reproduce it with:
library(data.table)
dat <- data.table(
go_id=rep(LETTERS[1:3], times=c(2,3,5))
)
dat[, gene_id := letters[1:nrow(dat)]]
I need to convert it to a list where each "key" is a go_id
with "value" a vector of genes assigned to that go_id
. That is, the output should be this list:
$A
[1] "a" "b"
$B
[1] "c" "d" "e"
$C
[1] "f" "g" "h" "i" "j"
If it matters, genes can be associated to multiple go_id
's. The real data has about 280000 rows with 17000 distinct go_id's and 16000 distinct genes.
This is the solution I have so far - is there anything better in terms of speed and/or readability?
dat <- dat[, list(gene_id=list(gene_id)), by=go_id]
go_ids <- dat$go_id
go_list <- list()
for(i in 1:nrow(dat)) {
go <- go_ids[i]
genes <- dat[i,]$gene_id
go_list[go] <- genes
}
CodePudding user response:
Without data.table
or any other package, this will do the task:
tapply(dat$gene_id, dat$go_id, FUN = c)
CodePudding user response:
with(DT, split(gene_id, go_id))
Using by
tmp = DT[, .(.(gene_id)), by = go_id]
setNames(tmp$V1, tmp$go_id)
# $A
# [1] "a" "b"
# $B
# [1] "c" "d" "e"
# $C
# [1] "f" "g" "h" "i" "j"