How to split factor names in a column within a data frame in R?-CodePudding

In the data frame (df), col2 contains encoded information separated by a period like d.MAG.5.2. I want to split the name into three categories genus, species, replicate that in this case would correspond to d, MAG, 5.2. Note that, I don't want to split the period of the numerics at the end.

col1 <- rnorm(1:3)
col2 <- c("d.MAG.1.1","d.MAG.3.4","d.TEX.5.6")

df <- data.frame(col1, col2)

I want the new dataset to look like this:

col1 <- rnorm(1:3)
genus <- c("d","d","d")
species <- c("MAG","MAG","TEX")
replicate <- c("1.1","3.4","5.6")
df <- data.frame(col1, genus, species, replicate)

CodePudding user response：

Using tidyr::extract you could do:

library(tidyr)

df |>
  tidyr::extract(col2,
    into = c("genus", "species", "replicate"),
    regex = "^([^.])\\.([^.]*)\\.(.*)$"
  )
#>   col1 genus species replicate
#> 1    1     d     MAG       1.1
#> 2    2     d     MAG       3.4
#> 3    3     d     TEX       5.6

CodePudding user response：

df %>% 
  mutate(col2 = sub("(^.*)\\.(.*)\\.(\\d \\.\\d )", "\\1,\\2,\\3", col2)) %>% 
  separate(col2, into = c("genus", "species", "replicate"), sep=",")

       col1 genus species replicate
1 0.7264348     d     MAG       1.1
2 1.0411627     d     MAG       3.4
3 1.2302765     d     TEX       5.6

CodePudding user response：

Another method:

library(tidyverse)

df %>%
  separate(col2, into = c("genus", "species", "r1", "r2")) %>%
  unite("replicate", r1:r2, sep = ".")
#>         col1 genus species replicate
#> 1 -1.9530625     d     MAG       1.1
#> 2 -1.2879991     d     MAG       3.4
#> 3  0.8091388     d     TEX       5.6

^{Created on 2022-11-02 with reprex v2.0.2}

CodePudding user response：

A couple options that both use some regex:

library(tidyverse)

#option 1
extract(df, 
        col2, 
        into = c("genus", "species" , "replicate"), 
        regex = "(\\w)\\.(\\w )\\.(.*$)")
#>         col1 genus species replicate
#> 1 1.48039902     d     MAG       1.1
#> 2 0.03728211     d     MAG       3.4
#> 3 0.28125162     d     TEX       5.6

#option 2
separate(df, 
         col2,
         into = c("genus", "species" , "replicate"),
         sep = "\\.(?!\\d$)")
#>         col1 genus species replicate
#> 1 1.48039902     d     MAG       1.1
#> 2 0.03728211     d     MAG       3.4
#> 3 0.28125162     d     TEX       5.6