Home > Software design >  Autohotekey: How to extract text between two words with multiple occurrences in a large text documen
Autohotekey: How to extract text between two words with multiple occurrences in a large text documen

Time:03-30

Using Autohotkey, I would like to copy a large text file to the clipboard, extract text between two repeated words, delete everything else, and paste the parsed text. I am trying to do this to a large text file with 80,000 lines of text where the start and stop words repeat 100s of times.

Any help would be greatly appreciated!

Input Text Example

Delete this text De l e te this text

StartWord

Apples Oranges
Pears Grapes

StopWord

Delete this text Delete this text

StartWord

Peas Carrots
Peas Carrots

StopWord

Delete this text Delete this text

Desired Output Text

Apples Oranges
Pears Grapes

Peas Carrots
Peas Carrots

I think I found a regex statement to extract text between two words, but don't know how to make it work for multiple instances of the start and stop words. Honestly, I can't even get this to work.

!c::
Send, ^c
Fullstring = %clipboard%
RegExMatch(Fullstring, "StartWord *\K.*?(?= *StopWord)", TrimmedResult)
Clipboard := %TrimmedResult%
Send, ^v
return

CodePudding user response:

You can start the match at StartWord, and then match all lines that do not start with either StartWord or StopWord

^StartWord\s*\K(?:\R(?!StartWord|StopWord).*) 
  • ^ Start of string
  • StartWord\s*\K Match StartWord, optional whitespace chars and then clear forget what is matched so far using \K
  • (?: Non capture group to repeat as a whole
    • \R Match a newline
    • (?!StartWord|StopWord).* Negative lookahead, assert that the line does not start with Start or Stopword
  • ) Close the non capture group and repeat 1 or more times to match at least a single line

See a regex demo.

CodePudding user response:

This is only slightly different than @Thefourthbird's solution.

You can match the following regular expression with general, multiline and dot-all flags set1:

^StartWord\R \K.*?\R(?=\R*^StopWord\R)

Demo

The regular expression can be broken down as follows:

^StartWord     # match 'StartWord' at the beginning of a line
\R             # match >= 1 line terminators to avoid matching empty lines
               # below
\K             # reset start of match to current location and discard
               # all previously-matched characters
.*?            # match >= 0 characters lazily
\R             # match a line terminator
(?=            # begin a positive lookahead
  \R*          # match >= 0 line terminators to avoid matching empty lines
               # above
  ^StopWord\R  # Match 'StopWord' at the beginning of a line followed
               # by a line terminator
)              # end positive lookahead

1. Click on /gms at the link to obtain explanations of the effects of each of the three flags.

  • Related