I'm using the pattern \\n(((?!\.g).)*?\.vcf\.gz)\\r
to match the desired sub-string in a string. In the following example string the match is in the middle of the string, engulfed by two \r\n
.
"\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."
Using the pattern above yields the desired string 1115492_23181_0_0.vcf.gz
as well as 0
.
My question is what would be the proper regular expression to get only the desired string.
Thanks.
CodePudding user response:
You have the match equalling lines, so match the whole lines that do not contain .g
anywhere before the .vcf.gz
extension:
import re
text = "\r\n1115492_23181_0_0.g.vcf.gz.tbi\r\n1115492_23181_0_0.vcf.gz\r\n1115492_23181_0_0.vcf.gz.tbi\r\n..."
m = re.search(r"^((?:(?!\.g).)*\.vcf\.gz)\r?$", text, re.M)
if m:
print(m.group(1)) # => 1115492_23181_0_0.vcf.gz
See the Python demo.
Details:
^
- start of a line((?:(?!\.g).)*\.vcf\.gz)
- Group 1:(?:(?!\.g).)*
- any char other than line break chars, one or more but as many as possible occurrences, that does not start a.g
char sequence\.vcf\.gz
- a.vcf.gz
string
\r?
- an optional CR (carriage return)$
- end of a line.
CodePudding user response:
I have a large log file, and I want to extract a multi-line string between two strings: start and end.
The following is sample from the inputfile:
start spam start rubbish start wait for it... profit! here end start garbage start second match win. end The desired solution should print:
start wait for it... profit! here end start second match win. end I tried a simple regex but it returned everything from start spam. How should this be done?
Edit: Additional info on real-life computational complexity:
actual file size: 2GB occurrences of 'start': ~ 12 M, evenly distributed occurences of 'end': ~800, near the end of the file.