Extract a string between two letters and create a new column in R dplyr-CodePudding

I have a data.frame that looks like this

library(tidyverse)
df1 <- tibble(genes=c("AT1G02205","AT1G02160","AT5G02160", "ATCG02160"))
df1
#> # A tibble: 4 × 1
#>   genes    
#>   <chr>    
#> 1 AT1G02205
#> 2 AT1G02160
#> 3 AT5G02160
#> 4 ATCG02160

^{Created on 2022-10-19 with reprex v2.0.2}

and I want to extract anything between the letters A and T and create a new column so my new.df looks like

#>   genes         chr
#>   <chr>    
#> 1 AT1G02205     Chr1
#> 2 AT1G02160     Chr1
#> 3 AT5G02160     Chr5
#> 4 ATCG02160     ChrC

So far, I have found a nasty way to do this, but I am sure I could have done better.

``` r
library(tidyverse)
df1 <- tibble(genes=c("AT1G02205","AT1G02160","AT5G02160", "ATCG02160"))

new.df <-  df1 |> 
  mutate(chr=str_extract(genes, "T(.*?)G"))  |> 
  mutate(chr=str_replace_all(chr, c("T"="", "G"=""))) |> 
  mutate(chr=paste0("Chr",chr))
new.df
#> # A tibble: 4 × 2
#>   genes     chr  
#>   <chr>     <chr>
#> 1 AT1G02205 Chr1 
#> 2 AT1G02160 Chr1 
#> 3 AT5G02160 Chr5 
#> 4 ATCG02160 ChrC

^{Created on 2022-10-19 with reprex v2.0.2}

CodePudding user response：

You can use str_match:

library(stringr)
library(dplyr)
df1 %>% 
  mutate(chr = str_c("Chr", str_match(genes, "T(.*)G")[, 2]))

#   genes     chr  
# 1 AT1G02205 Chr1 
# 2 AT1G02160 Chr1 
# 3 AT5G02160 Chr5 
# 4 ATCG02160 ChrC

Or in base R with gsub:

df1 |>
  transform(chr = paste0("Chr", gsub(".*T(.*)G.*", '\\1', genes)))