Home > database >  Extract characters and numbers from a string using R
Extract characters and numbers from a string using R

Time:11-09

Here is part of my data frame.

> df
    Group Direction cytoband  q value residual q value      wide peak boundaries
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554

I want to extract the character or number after "chr" in the "wide peak boundaries" column. I tried the code below but the second row get NA values.

library(tidyr)
df <- extract(df, 'wide peak boundaries', into = c('chr', 'start', 'end'), 
              '(\\d ) :(\\d ) -(\\d )', remove = F, convert = T)
df
    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553  NA        NA        NA
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

data

structure(list(Group = c("All", "All", "All", "All", "All"), 
    Direction = c("DEL", "DEL", "DEL", "DEL", "DEL"), cytoband = c("11q25", 
    "Xp22.11", "10q23.31", "22q12.3", "11p15.4"), `q value` = c("7.78E-43", 
    "3.01E-38", "3.61E-31", "4.03E-25", "6.59E-25"), `residual q value` = c("2.22E-39", 
    "1.91E-35", "3.61E-31", "3.96E-25", "6.59E-25"), `wide peak boundaries` = c("chr11:130906630-135086622", 
    "chrX:23277186-26139553", "chr10:87745632-87859602", "chr22:33050952-34766503", 
    "chr11:3230287-3799554"), chr = c(11L, NA, 10L, 22L, 11L), 
    start = c(130906630L, NA, 87745632L, 33050952L, 3230287L), 
    end = c(135086622L, NA, 87859602L, 34766503L, 3799554L)), class = "data.frame", row.names = c("V29", 
"V30", "V31", "V32", "V33"))

CodePudding user response:

Idea is to split by : and -, but for the chr column you don't extract the "chr" string. So you could use:

(updated based on comment from @Chris Ruehlemann)

df %>%
  extract("wide peak boundaries",
          into = c("chr", "start", "end"),
          regex = "((?<=chr).*):(.*)-(.*)",
          remove = FALSE)

which gives:

    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553   X  23277186  26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554

CodePudding user response:

library(data.table)
setDT(mydata)[, c("chr", "start", "end") := tstrsplit(`wide peak boundaries`, "[:-]", perl = TRUE)]

   Group Direction cytoband  q value residual q value      wide peak boundaries   chr     start       end
1:   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622 chr11 130906630 135086622
2:   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553  chrX  23277186  26139553
3:   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602 chr10  87745632  87859602
4:   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503 chr22  33050952  34766503
5:   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554 chr11   3230287   3799554

CodePudding user response:

You only need to change \\d in the first capture group to \\w (\\d matches only digits, whereas \\w matches alphabetic characters and digits and the underscore):

EDIT: (?<=chr) is positive lookbehind, it makes sure that \\wonly starts matching after the string chr has occurred:

df %>% 
  extract(col = 'wide peak boundaries', 
          into = c('chr', 'start', 'end'),
          regex = '((?<=chr)\\w ):(\\d )-(\\d )', 
          remove = FALSE, convert = TRUE)
    Group Direction cytoband  q value residual q value      wide peak boundaries chr     start       end
V29   All       DEL    11q25 7.78E-43         2.22E-39 chr11:130906630-135086622  11 130906630 135086622
V30   All       DEL  Xp22.11 3.01E-38         1.91E-35    chrX:23277186-26139553   X  23277186  26139553
V31   All       DEL 10q23.31 3.61E-31         3.61E-31   chr10:87745632-87859602  10  87745632  87859602
V32   All       DEL  22q12.3 4.03E-25         3.96E-25   chr22:33050952-34766503  22  33050952  34766503
V33   All       DEL  11p15.4 6.59E-25         6.59E-25     chr11:3230287-3799554  11   3230287   3799554
  •  Tags:  
  • r
  • Related