Chromosome_name | Start Position |
---|---|
CHR_HSCHR7_2_CTG6 | 142857940 |
CHR_HSCHR19LRC_PGF2_CTG3_1 | 54316049 |
I have just started to use R. I have a data frame of chromosome names but I just want to replace the long names with the number of the chromosome. i.e CHR_HSCHR19LRC_PGF2_CTG3_1 would be "19" I need to replace the long name with the number just after the characters "HRCHR" How would I do this?
I tried the method of manually entry the replacement value:
gsub(".*HSCHR19", "19", dataframe)
But this takes far too long for a list of >100 values. I would like to find a way to do this automatically.
CodePudding user response:
You can use
sub('^.*CHR(\\d ).*$', '\\1', Chromosome_name)
#> [1] "7" "19"
CodePudding user response:
Another potential option is a look-behind regex, e.g.
library(tidyverse)
df <- read.table(text = "Chromosome_name Start_Position
CHR_HSCHR7_2_CTG6 142857940
CHR_HSCHR19LRC_PGF2_CTG3_1 54316049", header = TRUE)
df2 <- df %>%
mutate(Chromosome_name = str_extract(Chromosome_name, "(?<=HSCHR)\\d "))
df2
#> Chromosome_name Start_Position
#> 1 7 142857940
#> 2 19 54316049
Created on 2022-03-22 by the reprex package (v2.0.1)