So there are 5 columns in my dataframe:
ID pos strand nucleotide count
id1 12 A 13
id1 13 C 25
id2 24 G 10
id2 25 T 25
id2 26 A 10
id3 10 C 5
I am trying to make a list or a dictionary like this:
mylist <- [[id1,[[id1,12, ,A,13],[id1,13, ,C,25]]],
[id2,[[id2,24, ,G,10],[id2,25, ,T,25],[id2,26, ,A,10]]],
[id3,[[id3,10, ,C,5]]]
So basically, it is a list of lists (which has two elements, one is the id name, the other is a list of rows, each row should also be a list as well).
I have tried this code below:
myl = list()
for (i in seq(nrow(res))) {
myl[[i]] <- unclass(res[i,])
}
But it only gives me lists of rows, not grouped by the id. I have also tried using df_to_nest:
nestedlist = df_to_nest(data.table(dat), as.vector("seqnames"), count_col = NULL, value_cols = c('pos', 'strand', 'nucleotide', 'count'))
But it only has names and did not have anything for elements.
Is there anything else I can try?
Output of str(data)
:
'data.frame': 7 obs. of 5 variables:
$ seqnames : Factor w/ 3138 levels "id1","id2",..: 322 322 330 330 330 994 994
$ pos : int 2805 2806 5066 5067 5068 3348 3349
$ strand : Factor w/ 3 levels " ","-","*": 2 2 1 1 1 2 1
$ nucleotide: Factor w/ 8 levels "A","C","G","T",..: 1 4 3 3 3 1 1
$ count : int 1 1 1 1 1 97 101
I have a really large dataframe, here I only chose 7 rows with 3 different id names.
CodePudding user response:
Something like this ought to work:
data <- read.table(header = TRUE, text =
" ID pos strand nucleotide count
id1 12 A 13
id1 13 C 25
id2 24 G 10
id2 25 T 25
id2 26 A 10
id3 10 C 5")
l <- split(unname(split(data, seq_len(nrow(data)))), data$ID)
ll <- Map(list, names(l), l)
ll
## $id1
## $id1[[1]]
## [1] "id1"
##
## $id1[[2]]
## $id1[[2]][[1]]
## ID pos strand nucleotide count
## 1 id1 12 A 13
##
## $id1[[2]][[2]]
## ID pos strand nucleotide count
## 2 id1 13 C 25
##
##
##
## $id2
## $id2[[1]]
## [1] "id2"
##
## $id2[[2]]
## $id2[[2]][[1]]
## ID pos strand nucleotide count
## 3 id2 24 G 10
##
## $id2[[2]][[2]]
## ID pos strand nucleotide count
## 4 id2 25 T 25
##
## $id2[[2]][[3]]
## ID pos strand nucleotide count
## 5 id2 26 A 10
##
##
##
## $id3
## $id3[[1]]
## [1] "id3"
##
## $id3[[2]]
## $id3[[2]][[1]]
## ID pos strand nucleotide count
## 6 id3 10 C 5
Here, ll
is named list of 2-element lists of the form:
list(<ID>, <list of 1-row data frames>)
But there is a redundancy, because ll
stores each ID twice: once in its names
attribute and again as the first element of one of the 2-element sublists. In R, it would be more natural to use l
, rather than ll
, as your "dictionary", because then you could do l$id1
to retrieve the list of id1
rows, rather than ll$id1[[2L]]
.