This is somewhat similar to my previous question Split data frame string column and count items. (dplyr and R) ,but what I would like to know is how to split column items and turn the return value into vector instead of list.
library("tidyverse")
dat <- data.frame(ID = c("A", "B"),
gene_ids = c(
"101739/20382/13006/212377/114714/66622/140917",
"75717/103573/14852/18141/12567/26429/20842/17975/12545"
)
)
tmp <- dat %>% mutate(ids = str_split(gene_ids, "/"))
tmp$ids
#> [[1]]
#> [1] "101739" "20382" "13006" "212377" "114714" "66622" "140917"
#>
#> [[2]]
#> [1] "75717" "103573" "14852" "18141" "12567" "26429" "20842" "17975"
#> [9] "12545"
tmp
#> ID gene_ids
#> 1 A 101739/20382/13006/212377/114714/66622/140917
#> 2 B 75717/103573/14852/18141/12567/26429/20842/17975/12545
#> ids
#> 1 101739, 20382, 13006, 212377, 114714, 66622, 140917
#> 2 75717, 103573, 14852, 18141, 12567, 26429, 20842, 17975, 12545
dat %>% mutate(please_be_vector = str_split(gene_ids, "/") %>% unlist())
#> Error: Problem with `mutate()` input `please_be_vector`.
#> x Input `please_be_vector` can't be recycled to size 2.
#> ℹ Input `please_be_vector` is `str_split(gene_ids, "/") %>% unlist()`.
#> ℹ Input `please_be_vector` must be size 2 or 1, not 16.
I would like tmp$ids
to be vector instead of list like the below. Is this possible using dplyr?
tmp$ids[1]
"101739" "20382" "13006" "212377" "114714" "66622" "140917"
tmp$ids[2]
"75717" "103573" "14852" "18141" "12567" "26429" "20842" "17975" "12545"
Is it possible?
CodePudding user response:
We can simply use unclass
on the nested data, to have a list of vectors
library(dplyr)
dat %>% separate_rows(everything(), sep = "/")%>%
pivot_wider(names_from = ID, values_from = gene_ids)%>%
unclass
$A
$A[[1]]
[1] "101739" "20382" "13006" "212377" "114714" "66622" "140917"
$B
$B[[1]]
[1] "75717" "103573" "14852" "18141" "12567" "26429" "20842" "17975" "12545"
CodePudding user response:
tmp$ids
is a list of two character vectors, one for each row of the data. When you subset a list using [
, you get a list. Instead use [[
:
> tmp$ids[[1]]
[1] "101739" "20382" "13006" "212377" "114714" "66622" "140917"
A good resource to understand this better is the chapter on subsetting in Advanced R.
CodePudding user response:
Update: Maybe this one:
dat %>%
separate_rows(gene_ids) %>%
arrange(ID, gene_ids) %>%
group_by(ID) %>%
mutate(id = row_number()) %>%
pivot_wider(
names_from = ID,
values_from = gene_ids
) %>%
pull(A) # alternative pull(B)
[1] "101739" "114714" "13006" "140917" "20382" "212377" "66622" NA
[9] NA
First answer:
library(tidyverse)
dat %>% mutate(ids = str_split(gene_ids, "/")) %>%
unnest(ids) %>%
pull(ids)
output:
[1] "101739" "20382" "13006" "212377" "114714" "66622" "140917" "75717"
[9] "103573" "14852" "18141" "12567" "26429" "20842" "17975" "12545"
or:
temp <- dat %>% mutate(ids = str_split(gene_ids, "/"))
unlist(tmp$ids)
output:
[1] "101739" "20382" "13006" "212377" "114714" "66622" "140917" "75717"
[9] "103573" "14852" "18141" "12567" "26429" "20842" "17975" "12545