Using Autohotkey, I would like to copy a large text file to the clipboard, extract text between two repeated words, delete everything else, and paste the parsed text. I am trying to do this to a large text file with 80,000 lines of text where the start and stop words repeat 100s of times.
Any help would be greatly appreciated!
Input Text Example
Delete this text De l e te this text
StartWord
Apples Oranges
Pears GrapesStopWord
Delete this text Delete this text
StartWord
Peas Carrots
Peas CarrotsStopWord
Delete this text Delete this text
Desired Output Text
Apples Oranges
Pears GrapesPeas Carrots
Peas Carrots
I think I found a regex statement to extract text between two words, but don't know how to make it work for multiple instances of the start and stop words. Honestly, I can't even get this to work.
!c::
Send, ^c
Fullstring = %clipboard%
RegExMatch(Fullstring, "StartWord *\K.*?(?= *StopWord)", TrimmedResult)
Clipboard := %TrimmedResult%
Send, ^v
return
CodePudding user response:
You can start the match at StartWord, and then match all lines that do not start with either StartWord or StopWord
^StartWord\s*\K(?:\R(?!StartWord|StopWord).*)
^
Start of stringStartWord\s*\K
Match StartWord, optional whitespace chars and then clear forget what is matched so far using\K
(?:
Non capture group to repeat as a whole\R
Match a newline(?!StartWord|StopWord).*
Negative lookahead, assert that the line does not start with Start or Stopword
)
Close the non capture group and repeat 1 or more times to match at least a single line
See a regex demo.
CodePudding user response:
This is only slightly different than @Thefourthbird's solution.
You can match the following regular expression with general, multiline and dot-all flags set1:
^StartWord\R \K.*?\R(?=\R*^StopWord\R)
The regular expression can be broken down as follows:
^StartWord # match 'StartWord' at the beginning of a line
\R # match >= 1 line terminators to avoid matching empty lines
# below
\K # reset start of match to current location and discard
# all previously-matched characters
.*? # match >= 0 characters lazily
\R # match a line terminator
(?= # begin a positive lookahead
\R* # match >= 0 line terminators to avoid matching empty lines
# above
^StopWord\R # Match 'StopWord' at the beginning of a line followed
# by a line terminator
) # end positive lookahead
1. Click on /gms
at the link to obtain explanations of the effects of each of the three flags.