Here is part of my data frame.
> df
Group Direction cytoband q value residual q value wide peak boundaries
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554
I want to extract the character or number after "chr" in the "wide peak boundaries" column. I tried the code below but the second row get NA values.
library(tidyr)
df <- extract(df, 'wide peak boundaries', into = c('chr', 'start', 'end'),
'(\\d ) :(\\d ) -(\\d )', remove = F, convert = T)
df
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 NA NA NA
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554
data
structure(list(Group = c("All", "All", "All", "All", "All"),
Direction = c("DEL", "DEL", "DEL", "DEL", "DEL"), cytoband = c("11q25",
"Xp22.11", "10q23.31", "22q12.3", "11p15.4"), `q value` = c("7.78E-43",
"3.01E-38", "3.61E-31", "4.03E-25", "6.59E-25"), `residual q value` = c("2.22E-39",
"1.91E-35", "3.61E-31", "3.96E-25", "6.59E-25"), `wide peak boundaries` = c("chr11:130906630-135086622",
"chrX:23277186-26139553", "chr10:87745632-87859602", "chr22:33050952-34766503",
"chr11:3230287-3799554"), chr = c(11L, NA, 10L, 22L, 11L),
start = c(130906630L, NA, 87745632L, 33050952L, 3230287L),
end = c(135086622L, NA, 87859602L, 34766503L, 3799554L)), class = "data.frame", row.names = c("V29",
"V30", "V31", "V32", "V33"))
CodePudding user response:
Idea is to split by :
and -
, but for the chr
column you don't extract the "chr" string. So you could use:
(updated based on comment from @Chris Ruehlemann)
df %>%
extract("wide peak boundaries",
into = c("chr", "start", "end"),
regex = "((?<=chr).*):(.*)-(.*)",
remove = FALSE)
which gives:
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 X 23277186 26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554
CodePudding user response:
library(data.table)
setDT(mydata)[, c("chr", "start", "end") := tstrsplit(`wide peak boundaries`, "[:-]", perl = TRUE)]
Group Direction cytoband q value residual q value wide peak boundaries chr start end
1: All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 chr11 130906630 135086622
2: All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 chrX 23277186 26139553
3: All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 chr10 87745632 87859602
4: All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 chr22 33050952 34766503
5: All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 chr11 3230287 3799554
CodePudding user response:
You only need to change \\d
in the first capture group to \\w
(\\d
matches only digits, whereas \\w
matches alphabetic characters and digits and the underscore):
EDIT:
(?<=chr)
is positive lookbehind, it makes sure that \\w
only starts matching after the string chr
has occurred:
df %>%
extract(col = 'wide peak boundaries',
into = c('chr', 'start', 'end'),
regex = '((?<=chr)\\w ):(\\d )-(\\d )',
remove = FALSE, convert = TRUE)
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 X 23277186 26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554