Hi I am running this python code to reduce multi-line patterns to singletons however, I am doing this on extremely large files of 200,000 lines.
Here is my current code:
import sys
import re
with open('largefile.txt', 'r ') as file:
string = file.read()
string = re.sub(r"((?:^.*\n) )(?=\1)", "", string, flags=re.MULTILINE)
file.seek(0)
file.write(string)
file.truncate()
The problem is the re.sub() is taking ages (10m ) on my large files. Is it possible to speed this up in any way?
Example input file:
hello
mister
hello
mister
goomba
bananas
goomba
bananas
chocolate
hello
mister
Example output:
hello
mister
goomba
bananas
chocolate
hello
mister
These patterns can be bigger than 2 lines as well.
CodePudding user response:
Nesting a quantifier within a quantifier is expensive and in this case unnecessary.
You can use the following regex without nesting instead:
string = re.sub(r"(^.*\n)(?=\1)", "", string, flags=re.M | re.S)
In the following test it more than cuts the time in half compared to your approach: