Efficiently convert two-columns data.table to list-CodePudding

I'm looking for a fast yet readable solution to this problem in R. Preferably the solution should use the data.table package or no additional packages although I'd like to hear of other options.

I have a data.table with two columns like this one:

dat
    go_id gene_id
 1:     A       a
 2:     A       b
 3:     B       c
 4:     B       d
 5:     B       e
 6:     C       f
 7:     C       g
 8:     C       h
 9:     C       i
10:     C       j

You can reproduce it with:

library(data.table)

dat <- data.table(
    go_id=rep(LETTERS[1:3], times=c(2,3,5))
)
dat[, gene_id := letters[1:nrow(dat)]]

I need to convert it to a list where each "key" is a go_id with "value" a vector of genes assigned to that go_id. That is, the output should be this list:

$A
[1] "a" "b"

$B
[1] "c" "d" "e"

$C
[1] "f" "g" "h" "i" "j"

If it matters, genes can be associated to multiple go_id's. The real data has about 280000 rows with 17000 distinct go_id's and 16000 distinct genes.

This is the solution I have so far - is there anything better in terms of speed and/or readability?

dat <- dat[, list(gene_id=list(gene_id)), by=go_id]
go_ids <- dat$go_id
go_list <- list()
for(i in 1:nrow(dat)) {
    go <- go_ids[i]
    genes <- dat[i,]$gene_id
    go_list[go] <- genes
}

CodePudding user response：

Without data.table or any other package, this will do the task:

tapply(dat$gene_id, dat$go_id, FUN = c)

CodePudding user response：

with(DT, split(gene_id, go_id))

Using by

tmp = DT[, .(.(gene_id)), by = go_id]
setNames(tmp$V1, tmp$go_id)

# $A
# [1] "a" "b"    

# $B
# [1] "c" "d" "e"

# $C
# [1] "f" "g" "h" "i" "j"