Home > Net >  Remove first occurrence of special characters until the first word or word character in R using rege
Remove first occurrence of special characters until the first word or word character in R using rege

Time:07-28

For my project I am looking into removing parts of text based on the pattern of special characters. I have a long .txt file that has the below structure:

mycharobj=c("---------Some text is here.---------More text is here - [3548]----- Even more text is here.-----------More text is here - [408]--------- Even more text is here again.")

String continues following the above pattern.

My target is to remove parts that start with - and end - [number], such as:

"-----------------------More text is here - [3548]"
"-----------More text is here - [408]"

I am planning to use the below to remove these parts with (will be looped in the future)

library(stringr)
library(qdapRegex)

temp=unlist(regmatches(mycharobj, gregexpr("[[:digit:]] ", mycharobj)))
mycharobj=rm_between(mycharobj, "-", paste(temp[1],"]", sep=""))

but for this to work, I need a regex expression that will remove the first occurrence of "-----------" in text until the first word or word character. If a string starts with text (word or word characters), it needs to ignore this and identify the first occurrence of "-----------" for my potential loop to work.

I was wondering if this can be done with regular expressions? Any help is appreciated. I have a very computationally demanding solution for this; split the string based on the special character "-" and then identify the parts of the text that I need through a set of conditionals. But due to the fact that it takes a lot more of the processing time, this solution is not very scalable for processing a large number of such .txt files.

CodePudding user response:

You can use

gsub("-{9,}(?:(?!-{9}).)*?- \\[\\d ]", "", mycharobj, perl=TRUE)

See the regex demo.

Details:

  • -{9,} - nine or more - chars
  • (?:(?!-{9}).)*? - any one char, other than a line break char, zero or more but as few as possible occurrences, that does not start a nine hyphen char sequence
  • - \[ - a - [ string
  • \d - one or more digits
  • ] - a ] char.
  • Related