Home > OS >  Speed up re.sub() on large strings representing large files in python?
Speed up re.sub() on large strings representing large files in python?

Time:10-05

Hi I am running this python code to reduce multi-line patterns to singletons however, I am doing this on extremely large files of 200,000 lines.

Here is my current code:

import sys
import re

with open('largefile.txt', 'r ') as file:
    string = file.read()
    string = re.sub(r"((?:^.*\n) )(?=\1)", "", string, flags=re.MULTILINE)
    file.seek(0)
    file.write(string)
    file.truncate()

The problem is the re.sub() is taking ages (10m ) on my large files. Is it possible to speed this up in any way?

Example input file:

hello
mister
hello
mister
goomba
bananas
goomba
bananas
chocolate
hello
mister

Example output:

hello
mister
goomba
bananas
chocolate
hello
mister

These patterns can be bigger than 2 lines as well.

CodePudding user response:

Nesting a quantifier within a quantifier is expensive and in this case unnecessary.

You can use the following regex without nesting instead:

string = re.sub(r"(^.*\n)(?=\1)", "", string, flags=re.M | re.S)

In the following test it more than cuts the time in half compared to your approach:

https://replit.com/@blhsing/HugeTrivialExperiment

  • Related