I'm using R to split a messy string of gene names and as a first step am simply attempting to break the string into a list by spaces between characters using strsplit and regex but have been coming across this weird bug:
string <- ' " "KPNA2" "UBE2C" "CENPF" ## [4] "HMGB2"'
ccGenes <- strsplit(string, split = '\\s ')[[1]]
returns a length 1 nested list containing an object of type "character [8]" (not sure what type of object this indicates) that places a backslash in front of double quotes (" -> \") looks like this when printed:
"" "\"" "\"KPNA2\"" "\"UBE2C\"" "\"CENPF\"" "##" "[4]" "\"HMGB2\""
what I want is a list that looks like this:
" "KPNA2" "UBE2C" "KPNA2" "UBE2C" etc...
After I will clean up the quotes and non gene items. I realize this is probably not the most efficient way to go about cleaning up this string, I'm still relatively new to programming and am more curious why the strsplit line I'm using is returning such weird output.
Thanks!
CodePudding user response:
You can use a base R approach with
regmatches(string, gregexpr('(?<=")\\w (?=")', string, perl=TRUE))[[1]]
# => [1] "KPNA2" "UBE2C" "CENPF" "HMGB2"
See the R demo online and the regex demo. Mind the perl=TRUE
argument, it is necessary since this argument enables PCRE regex syntax.
Details:
(?<=")
- a positive lookbehind that requires a"
char to occur immediately to the left of the current position\w
- one or more letters, digits or underscores(?=")
- a positive lookahead that requires a"
char to occur immediately to the right of the current position.
If you want to avoid matching underscores and lowercase letters, replace \\w
with [A-Z0-9]
.
CodePudding user response:
We may use str_extract
to extract the alpha numeric characters after the "
- match one of more alpha numeric characters ([[:alnum:]]
) that follows the "
(within regex lookaround ((?<=")
))
library(stringr)
str_extract_all(string, '(?<=")[[:alnum:]] ')[[1]]
[1] "KPNA2" "UBE2C" "CENPF" "HMGB2"
Also, if we want to use strsplit
from base R
, split not only the space (\\s
), but also on the double quotes and other characters not needed (#
)
setdiff(strsplit(string, split = '["# ] |\\[\\d \\]')[[1]], "")
[1] "KPNA2" "UBE2C" "CENPF" "HMGB2"