Home > Software engineering >  Find the shortest match between two occurrences of a pattern
Find the shortest match between two occurrences of a pattern

Time:12-09

I'm using the pattern \\n(((?!\.g).)*?\.vcf\.gz)\\r to match the desired sub-string in a string. In the following example string the match is in the middle of the string, engulfed by two \r\n.

"\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."

Using the pattern above yields the desired string 1115492_23181_0_0.vcf.gz as well as 0.
My question is what would be the proper regular expression to get only the desired string.
Thanks.

CodePudding user response:

You have the match equalling lines, so match the whole lines that do not contain .g anywhere before the .vcf.gz extension:

import re
text = "\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."
m = re.search(r"^((?:(?!\.g).)*\.vcf\.gz)\r?$", text, re.M)
if m:
    print(m.group(1)) # => 1115492_23181_0_0.vcf.gz

See the Python demo.

Details:

  • ^ - start of a line
  • ((?:(?!\.g).)*\.vcf\.gz) - Group 1:
    • (?:(?!\.g).)* - any char other than line break chars, one or more but as many as possible occurrences, that does not start a .g char sequence
    • \.vcf\.gz - a .vcf.gz string
  • \r? - an optional CR (carriage return)
  • $ - end of a line.

CodePudding user response:

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam start rubbish start wait for it... profit! here end start garbage start second match win. end The desired solution should print:

start wait for it... profit! here end start second match win. end I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

actual file size: 2GB occurrences of 'start': ~ 12 M, evenly distributed occurences of 'end': ~800, near the end of the file.

  • Related