I have a data.frame that looks like this
library(tidyverse)
df1 <- tibble(genes=c("AT1G02205","AT1G02160","AT5G02160", "ATCG02160"))
df1
#> # A tibble: 4 × 1
#> genes
#> <chr>
#> 1 AT1G02205
#> 2 AT1G02160
#> 3 AT5G02160
#> 4 ATCG02160
Created on 2022-10-19 with reprex v2.0.2
and I want to extract anything between the letters A
and T
and create a new column so my new.df looks like
#> genes chr
#> <chr>
#> 1 AT1G02205 Chr1
#> 2 AT1G02160 Chr1
#> 3 AT5G02160 Chr5
#> 4 ATCG02160 ChrC
So far, I have found a nasty way to do this, but I am sure I could have done better.
``` r
library(tidyverse)
df1 <- tibble(genes=c("AT1G02205","AT1G02160","AT5G02160", "ATCG02160"))
new.df <- df1 |>
mutate(chr=str_extract(genes, "T(.*?)G")) |>
mutate(chr=str_replace_all(chr, c("T"="", "G"=""))) |>
mutate(chr=paste0("Chr",chr))
new.df
#> # A tibble: 4 × 2
#> genes chr
#> <chr> <chr>
#> 1 AT1G02205 Chr1
#> 2 AT1G02160 Chr1
#> 3 AT5G02160 Chr5
#> 4 ATCG02160 ChrC
Created on 2022-10-19 with reprex v2.0.2
CodePudding user response:
You can use str_match
:
library(stringr)
library(dplyr)
df1 %>%
mutate(chr = str_c("Chr", str_match(genes, "T(.*)G")[, 2]))
# genes chr
# 1 AT1G02205 Chr1
# 2 AT1G02160 Chr1
# 3 AT5G02160 Chr5
# 4 ATCG02160 ChrC
Or in base R with gsub
:
df1 |>
transform(chr = paste0("Chr", gsub(".*T(.*)G.*", '\\1', genes)))