Home > Mobile >  str_split for column values and then turn it into vector in R
str_split for column values and then turn it into vector in R

Time:12-21

This is somewhat similar to my previous question Split data frame string column and count items. (dplyr and R) ,but what I would like to know is how to split column items and turn the return value into vector instead of list.

library("tidyverse")
dat <- data.frame(ID = c("A", "B"),
                  gene_ids = c(
                    "101739/20382/13006/212377/114714/66622/140917",
                    "75717/103573/14852/18141/12567/26429/20842/17975/12545"
                  )
)

tmp <- dat %>% mutate(ids = str_split(gene_ids, "/")) 
tmp$ids
#> [[1]]
#> [1] "101739" "20382"  "13006"  "212377" "114714" "66622"  "140917"
#> 
#> [[2]]
#> [1] "75717"  "103573" "14852"  "18141"  "12567"  "26429"  "20842"  "17975" 
#> [9] "12545"
tmp
#>   ID                                               gene_ids
#> 1  A          101739/20382/13006/212377/114714/66622/140917
#> 2  B 75717/103573/14852/18141/12567/26429/20842/17975/12545
#>                                                              ids
#> 1            101739, 20382, 13006, 212377, 114714, 66622, 140917
#> 2 75717, 103573, 14852, 18141, 12567, 26429, 20842, 17975, 12545

dat %>% mutate(please_be_vector = str_split(gene_ids, "/") %>% unlist())
#> Error: Problem with `mutate()` input `please_be_vector`.
#> x Input `please_be_vector` can't be recycled to size 2.
#> ℹ Input `please_be_vector` is `str_split(gene_ids, "/") %>% unlist()`.
#> ℹ Input `please_be_vector` must be size 2 or 1, not 16.

I would like tmp$ids to be vector instead of list like the below. Is this possible using dplyr?

tmp$ids[1]
"101739" "20382"  "13006"  "212377" "114714" "66622"  "140917"
tmp$ids[2]
"75717"  "103573" "14852"  "18141"  "12567"  "26429"  "20842"  "17975" "12545"

Is it possible?

CodePudding user response:

We can simply use unclass on the nested data, to have a list of vectors

library(dplyr)

dat %>% separate_rows(everything(), sep = "/")%>%
        pivot_wider(names_from = ID, values_from = gene_ids)%>%
        unclass

$A
$A[[1]]
[1] "101739" "20382"  "13006"  "212377" "114714" "66622"  "140917"


$B
$B[[1]]
[1] "75717"  "103573" "14852"  "18141"  "12567"  "26429"  "20842"  "17975"  "12545" 

CodePudding user response:

tmp$ids is a list of two character vectors, one for each row of the data. When you subset a list using [, you get a list. Instead use [[:

> tmp$ids[[1]]
[1] "101739" "20382"  "13006"  "212377" "114714" "66622"  "140917"

A good resource to understand this better is the chapter on subsetting in Advanced R.

CodePudding user response:

Update: Maybe this one:

dat %>% 
  separate_rows(gene_ids) %>% 
  arrange(ID, gene_ids) %>% 
  group_by(ID) %>% 
  mutate(id = row_number()) %>% 
  pivot_wider(
    names_from = ID,
    values_from = gene_ids
  ) %>% 
  pull(A) # alternative pull(B)
[1] "101739" "114714" "13006"  "140917" "20382"  "212377" "66622"  NA      
[9] NA   

First answer:

library(tidyverse)

dat %>% mutate(ids = str_split(gene_ids, "/")) %>% 
  unnest(ids) %>% 
  pull(ids)

output:

 [1] "101739" "20382"  "13006"  "212377" "114714" "66622"  "140917" "75717" 
 [9] "103573" "14852"  "18141"  "12567"  "26429"  "20842"  "17975"  "12545" 

or:

temp <- dat %>% mutate(ids = str_split(gene_ids, "/")) 
unlist(tmp$ids)

output:

[1] "101739" "20382"  "13006"  "212377" "114714" "66622"  "140917" "75717" 
 [9] "103573" "14852"  "18141"  "12567"  "26429"  "20842"  "17975"  "12545
  • Related