Making a nested list with a dataframe in R-CodePudding

So there are 5 columns in my dataframe:

 ID         pos     strand    nucleotide     count
 id1         12                   A            13
 id1         13                   C            25
 id2         24                   G            10
 id2         25                   T            25
 id2         26                   A            10
 id3         10                   C            5

I am trying to make a list or a dictionary like this:

mylist <- [[id1,[[id1,12, ,A,13],[id1,13, ,C,25]]],
           [id2,[[id2,24, ,G,10],[id2,25, ,T,25],[id2,26, ,A,10]]],
           [id3,[[id3,10, ,C,5]]]

So basically, it is a list of lists (which has two elements, one is the id name, the other is a list of rows, each row should also be a list as well).

I have tried this code below:

myl = list()
for (i in seq(nrow(res))) {
   myl[[i]] <- unclass(res[i,])
}

But it only gives me lists of rows, not grouped by the id. I have also tried using df_to_nest:

nestedlist = df_to_nest(data.table(dat), as.vector("seqnames"), count_col = NULL, value_cols = c('pos', 'strand', 'nucleotide', 'count'))

But it only has names and did not have anything for elements.

Is there anything else I can try?

Output of str(data):

'data.frame':   7 obs. of  5 variables:
$ seqnames  : Factor w/ 3138 levels "id1","id2",..: 322 322 330 330 330 994 994
$ pos       : int  2805 2806 5066 5067 5068 3348 3349
$ strand    : Factor w/ 3 levels " ","-","*": 2 2 1 1 1 2 1
$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 1 4 3 3 3 1 1
$ count     : int  1 1 1 1 1 97 101

I have a really large dataframe, here I only chose 7 rows with 3 different id names.

CodePudding user response：

Something like this ought to work:

data <- read.table(header = TRUE, text = 
" ID         pos     strand    nucleotide     count
 id1         12                   A            13
 id1         13                   C            25
 id2         24                   G            10
 id2         25                   T            25
 id2         26                   A            10
 id3         10                   C            5")

l <- split(unname(split(data, seq_len(nrow(data)))), data$ID)
ll <- Map(list, names(l), l)
ll
## $id1
## $id1[[1]]
## [1] "id1"
## 
## $id1[[2]]
## $id1[[2]][[1]]
##    ID pos strand nucleotide count
## 1 id1  12                 A    13
## 
## $id1[[2]][[2]]
##    ID pos strand nucleotide count
## 2 id1  13                 C    25
## 
## 
## 
## $id2
## $id2[[1]]
## [1] "id2"
## 
## $id2[[2]]
## $id2[[2]][[1]]
##    ID pos strand nucleotide count
## 3 id2  24                 G    10
## 
## $id2[[2]][[2]]
##    ID pos strand nucleotide count
## 4 id2  25                 T    25
## 
## $id2[[2]][[3]]
##    ID pos strand nucleotide count
## 5 id2  26                 A    10
## 
## 
## 
## $id3
## $id3[[1]]
## [1] "id3"
## 
## $id3[[2]]
## $id3[[2]][[1]]
##    ID pos strand nucleotide count
## 6 id3  10                 C     5

Here, ll is named list of 2-element lists of the form:

list(<ID>, <list of 1-row data frames>)

But there is a redundancy, because ll stores each ID twice: once in its names attribute and again as the first element of one of the 2-element sublists. In R, it would be more natural to use l, rather than ll, as your "dictionary", because then you could do l$id1 to retrieve the list of id1 rows, rather than ll$id1[[2L]].