Extract characters and numbers from a string using R-CodePudding

Here is part of my data frame.

> df
    Group Direction cytoband  q value residual q value      wide peak boundaries
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554

I want to extract the character or number after "chr" in the "wide peak boundaries" column. I tried the code below but the second row get NA values.

library(tidyr)
df <- extract(df, 'wide peak boundaries', into = c('chr', 'start', 'end'), 
              '(\\d ) :(\\d ) -(\\d )', remove = F, convert = T)
df
    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553  NA        NA        NA
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

data

structure(list(Group = c("All", "All", "All", "All", "All"), 
    Direction = c("DEL", "DEL", "DEL", "DEL", "DEL"), cytoband = c("11q25", 
    "Xp22.11", "10q23.31", "22q12.3", "11p15.4"), `q value` = c("7.78E-43", 
    "3.01E-38", "3.61E-31", "4.03E-25", "6.59E-25"), `residual q value` = c("2.22E-39", 
    "1.91E-35", "3.61E-31", "3.96E-25", "6.59E-25"), `wide peak boundaries` = c("chr11:130906630-135086622", 
    "chrX:23277186-26139553", "chr10:87745632-87859602", "chr22:33050952-34766503", 
    "chr11:3230287-3799554"), chr = c(11L, NA, 10L, 22L, 11L), 
    start = c(130906630L, NA, 87745632L, 33050952L, 3230287L), 
    end = c(135086622L, NA, 87859602L, 34766503L, 3799554L)), class = "data.frame", row.names = c("V29", 
"V30", "V31", "V32", "V33"))

CodePudding user response：

Idea is to split by : and -, but for the chr column you don't extract the "chr" string. So you could use:

(updated based on comment from @Chris Ruehlemann)

df %>%
  extract("wide peak boundaries",
          into = c("chr", "start", "end"),
          regex = "((?<=chr).*):(.*)-(.*)",
          remove = FALSE)

which gives:

    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553   X  23277186  26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

CodePudding user response：

library(data.table)
setDT(mydata)[, c("chr", "start", "end") := tstrsplit(`wide peak boundaries`, "[:-]", perl = TRUE)]

   Group Direction cytoband  q value residual q value      wide peak boundaries   chr     start       end
1:   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622 chr11 130906630 135086622
2:   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553  chrX  23277186  26139553
3:   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602 chr10  87745632  87859602
4:   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503 chr22  33050952  34766503
5:   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554 chr11   3230287   3799554

CodePudding user response：

You only need to change \\d in the first capture group to \\w (\\d matches only digits, whereas \\w matches alphabetic characters and digits and the underscore):

EDIT: (?<=chr) is positive lookbehind, it makes sure that \\wonly starts matching after the string chr has occurred:

df %>% 
  extract(col = 'wide peak boundaries', 
          into = c('chr', 'start', 'end'),
          regex = '((?<=chr)\\w ):(\\d )-(\\d )', 
          remove = FALSE, convert = TRUE)
    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553   X  23277186  26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554