Home > Software engineering >  Split R string into individual characters
Split R string into individual characters

Time:04-13

I think this should be simple, but I can't find another example that works for my purposes. I have many DNA sequences in 1 column in R, but I would like to split them into many columns with 1 base pair per column. For example:

V$1
ggggcc
cccctt
tttttt
aaaaaa

I want it to look like

V$1 V$2 V$3 V$4 V$5 V$6
 g   g   g   g   c   c
 c   c   c   c   t   t
 t   t   t   t   t   t
 a   a   a   a   a   a

I have tried

paste(L1HS2, collapse = "")
unlist(strsplit(L1HS2, split = ""))

and

data.frame(str_split_fixed(L1HS2, "", max(nchar(L1HS2))))

But I lose the data frame structure and end up with 1 very long row with many columns. This has to be easy, right? TIA!

CodePudding user response:

You could use

data.frame(Reduce(rbind, strsplit(df$V1, "")))

This returns

     X1 X2 X3 X4 X5 X6
init  g  g  g  g  c  c
X     c  c  c  c  t  t
X.1   t  t  t  t  t  t
X.2   a  a  a  a  a  a

or

data.frame(do.call(rbind, strsplit(df$V1, "")))

which returns

  X1 X2 X3 X4 X5 X6
1  g  g  g  g  c  c
2  c  c  c  c  t  t
3  t  t  t  t  t  t
4  a  a  a  a  a  a

CodePudding user response:

You can use separate from tidyr.

# first the data:
'V1
ggggcc
cccctt
tttttt
aaaaaa' %>% data.table::fread(data.table = FALSE) -> df

sl <- seq_len(nchar(df$V1[1]))
separate(df, V1, paste0('X', sl), sep = sl)
  X1 X2 X3 X4 X5 X6
1  g  g  g  g  c  c
2  c  c  c  c  t  t
3  t  t  t  t  t  t
4  a  a  a  a  a  a

Separating on the empty string ("") doesn't work very nicely with separate, so I separate on each numeric position instead.

CodePudding user response:

Another possible solution:

library(tidyverse)

df <- data.frame(V1 = c("ggggcc", "cccctt", "tttttt", "aaaaaa"))

df %>% 
 mutate(map_df(V1, ~ str_split(.x, "") %>% map(~ set_names(., str_c("V", 1:6)))))

#>   V1 V2 V3 V4 V5 V6
#> 1  g  g  g  g  c  c
#> 2  c  c  c  c  t  t
#> 3  t  t  t  t  t  t
#> 4  a  a  a  a  a  a
  • Related