For my project I am looking into removing parts of text based on the pattern of special characters. I have a long .txt file that has the below structure:
mycharobj=c("---------Some text is here.---------More text is here - [3548]----- Even more text is here.-----------More text is here - [408]--------- Even more text is here again.")
String continues following the above pattern.
My target is to remove parts that start with - and end - [number]
, such as:
"-----------------------More text is here - [3548]"
"-----------More text is here - [408]"
I am planning to use the below to remove these parts with (will be looped in the future)
library(stringr)
library(qdapRegex)
temp=unlist(regmatches(mycharobj, gregexpr("[[:digit:]] ", mycharobj)))
mycharobj=rm_between(mycharobj, "-", paste(temp[1],"]", sep=""))
but for this to work, I need a regex expression that will remove the first occurrence of "-----------"
in text until the first word or word character. If a string starts with text (word or word characters), it needs to ignore this and identify the first occurrence of "-----------"
for my potential loop to work.
I was wondering if this can be done with regular expressions? Any help is appreciated. I have a very computationally demanding solution for this; split the string based on the special character "-" and then identify the parts of the text that I need through a set of conditionals. But due to the fact that it takes a lot more of the processing time, this solution is not very scalable for processing a large number of such .txt files.
CodePudding user response:
You can use
gsub("-{9,}(?:(?!-{9}).)*?- \\[\\d ]", "", mycharobj, perl=TRUE)
See the regex demo.
Details:
-{9,}
- nine or more-
chars(?:(?!-{9}).)*?
- any one char, other than a line break char, zero or more but as few as possible occurrences, that does not start a nine hyphen char sequence- \[
- a- [
string\d
- one or more digits]
- a]
char.