In the data frame (df
), col2
contains encoded information separated by a period like d.MAG.5.2
. I want to split the name
into three categories genus
, species
, replicate
that in this case would correspond to d
, MAG
, 5.2
. Note that, I don't want to split the period of the numerics at the end.
col1 <- rnorm(1:3)
col2 <- c("d.MAG.1.1","d.MAG.3.4","d.TEX.5.6")
df <- data.frame(col1, col2)
I want the new dataset to look like this:
col1 <- rnorm(1:3)
genus <- c("d","d","d")
species <- c("MAG","MAG","TEX")
replicate <- c("1.1","3.4","5.6")
df <- data.frame(col1, genus, species, replicate)
CodePudding user response:
Using tidyr::extract
you could do:
library(tidyr)
df |>
tidyr::extract(col2,
into = c("genus", "species", "replicate"),
regex = "^([^.])\\.([^.]*)\\.(.*)$"
)
#> col1 genus species replicate
#> 1 1 d MAG 1.1
#> 2 2 d MAG 3.4
#> 3 3 d TEX 5.6
CodePudding user response:
df %>%
mutate(col2 = sub("(^.*)\\.(.*)\\.(\\d \\.\\d )", "\\1,\\2,\\3", col2)) %>%
separate(col2, into = c("genus", "species", "replicate"), sep=",")
col1 genus species replicate
1 0.7264348 d MAG 1.1
2 1.0411627 d MAG 3.4
3 1.2302765 d TEX 5.6
CodePudding user response:
Another method:
library(tidyverse)
df %>%
separate(col2, into = c("genus", "species", "r1", "r2")) %>%
unite("replicate", r1:r2, sep = ".")
#> col1 genus species replicate
#> 1 -1.9530625 d MAG 1.1
#> 2 -1.2879991 d MAG 3.4
#> 3 0.8091388 d TEX 5.6
Created on 2022-11-02 with reprex v2.0.2
CodePudding user response:
A couple options that both use some regex:
library(tidyverse)
#option 1
extract(df,
col2,
into = c("genus", "species" , "replicate"),
regex = "(\\w)\\.(\\w )\\.(.*$)")
#> col1 genus species replicate
#> 1 1.48039902 d MAG 1.1
#> 2 0.03728211 d MAG 3.4
#> 3 0.28125162 d TEX 5.6
#option 2
separate(df,
col2,
into = c("genus", "species" , "replicate"),
sep = "\\.(?!\\d$)")
#> col1 genus species replicate
#> 1 1.48039902 d MAG 1.1
#> 2 0.03728211 d MAG 3.4
#> 3 0.28125162 d TEX 5.6