Find the shortest match between two occurrences of a pattern-CodePudding

I'm using the pattern \\n(((?!\.g).)*?\.vcf\.gz)\\r to match the desired sub-string in a string. In the following example string the match is in the middle of the string, engulfed by two \r\n.

"\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."

Using the pattern above yields the desired string 1115492_23181_0_0.vcf.gz as well as 0.
My question is what would be the proper regular expression to get only the desired string.
Thanks.

CodePudding user response：

You have the match equalling lines, so match the whole lines that do not contain .g anywhere before the .vcf.gz extension:

import re
text = "\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."
m = re.search(r"^((?:(?!\.g).)*\.vcf\.gz)\r?$", text, re.M)
if m:
    print(m.group(1)) # => 1115492_23181_0_0.vcf.gz

See the Python demo.

Details:

^ - start of a line
((?:(?!\.g).)*\.vcf\.gz) - Group 1:
- (?:(?!\.g).)* - any char other than line break chars, one or more but as many as possible occurrences, that does not start a .g char sequence
- \.vcf\.gz - a .vcf.gz string
\r? - an optional CR (carriage return)
$ - end of a line.

CodePudding user response：

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam start rubbish start wait for it... profit! here end start garbage start second match win. end The desired solution should print:

start wait for it... profit! here end start second match win. end I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

actual file size: 2GB occurrences of 'start': ~ 12 M, evenly distributed occurences of 'end': ~800, near the end of the file.