Home > database >  How can I find and replace text in a larger file (150MB-250MB) with regular expressions in C#?
How can I find and replace text in a larger file (150MB-250MB) with regular expressions in C#?

Time:11-03

I am working with files that range between 150MB and 250MB, and I need to append a form feed (/f) character to each match found in a match collection. Currently, my regular expression for each match is this:

Regex myreg = new Regex("ABC: DEF11-1111(.*?)MORE DATA(.*?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);

and I'd like to modify each match in the file (and then overwrite the file) to become something that could be later found with a shorter regular expression:

Regex myreg = new Regex("ABC: DEF11-1111(.*?)\f\f, RegexOptions.Singleline);

Put another way, I want to simply append a form feed character (\f) to each match that is found in my file and save it.

I see a ton of examples on stack overflow for replacing text, but not so much for larger files. Typical examples of what to do would include:

  • Using streamreader to store the entire file in a string, then do a find and replace in that string.
  • Using MatchCollection in combination with File.ReadAllText()
  • Read the file line by line and look for matches there.

The problem with the first two is that is just eats up a ton of memory, and I worry about the program being able to handle all of that. The problem with the 3rd option is that my regular expression spans over many rows, and thus will not be found in a single line. I see other posts out there as well, but they cover replacing specific strings of text rather than working with regular expressions.

What would be a good approach for me to append a form feed character to each match found in a file, and then save that file?

Edit:

Per some suggestions, I tried playing around with StreamReader.ReadLine(). Specifically, I would read a line, see if it matched my expression, and then based on that result I would write to a file. If it matched the expression, I would write to the file. If it didn't match the expression, I would just append it to a string until it did match the expression. Like this:

Regex myreg = new Regex("ABC: DEF11-1111(.?)MORE DATA(.?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);

//For storing/comparing our match.
string line, buildingmatch, match, whatremains;
buildingmatch = "";
match = "";
whatremains = "";

//For keep track of trailing bits after our match.
int matchlength = 0;

using (StreamWriter sw = new StreamWriter(destFile))
using (StreamReader sr = new StreamReader(srcFile))
{
    //While we are still reading lines in the file...
    while ((line = sr.ReadLine()) != null)
    {
        //Keep adding lines to buildingmatch until we can match the regular expression.
        buildingmatch = buildingmatch   line   "\r\n";
        if (myreg.IsMatch(buildingmatch)
        {
            match = myreg.Match(buildingmatch).Value;
            matchlength = match.Lengh;
            
            //Make sure we are not at the end of the file.
            if (matchlength < buildingmatch.Length)
            {
                whatremains = buildingmatch.SubString(matchlength, buildingmatch.Length - matchlength);
            }
            
            sw.Write(match,   "\f\f");
            buildingmatch = whatremains;
            whatremains = "";
        }
    }
}

The problem is that this took about 55 minutes to run a roughly 150MB file. There HAS to be a better way to do this...

CodePudding user response:

I was able to find a solution that works in a reasonable time; it can process my entire 150MB file in under 5 minutes.

First, as mentioned in the comments, it's a waste to compare the string to the Regex after every iteration. Rather, I started with this:

string match = File.ReadAllText(srcFile);
MatchCollection mymatches = myregex.Matches(match);

Strings can hold up to 2GB of data, so while not ideal, I figured roughly 150MB worth wouldn't hurt to be stored in a string. Then, as opposed to checking a match every x amount of lines read in from the file, I can check the file for matches all at once!

Next, I used this:

StringBuilder matchsb = new StringBuilder(134217728);
foreach (Match m in mymatches)
{
     matchsb.Append(m.Value   "\f\f");
}

Since I already know (roughly) the size of my file, I can go ahead and initialize my stringbuilder. Not to mention, it's a lot more efficient to use string builder if you are doing multiple operations on a string (which I was). From there, it's just a matter of appending the form feed to each of my matches.

Finally, the part the cost the most on performance:

using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
     sw.Write(matchsb.ToString());
}

The way that you initialize StreamWriter is critical. Normally, you just declare it as:

StreamWriter sw = new StreamWriter(destfile);

This is fine for most use cases, but the problem becomes apparent with you are dealing with larger files. When declared like this, you are writing to the file with a default buffer of 4KB. For a smaller file, this is fine. But for 150MB files? This will end up taking a long time. So I corrected the issue by changing the buffer to approximately 5MB.

I found this resource really helped me to understand how to write to files more efficiently: https://www.jeremyshanks.com/fastest-way-to-write-text-files-to-disk-in-c/

Hopefully this will help the next person along as well.

CodePudding user response:

If you can load the whole string data into a single string variable, there is no need to first match and then append text to matches in a loop. You can use a single Regex.Replace operation:

string text = File.ReadAllText(srcFile);
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8))
{
     sw.Write(myregex.Replace(text, "$&\f\f"));
}

Details:

  • string text = File.ReadAllText(srcFile); - reads the srcFile file to the text variable (match would be confusing)
  • myregex.Replace(text, "$&\f\f") - replaces all occurrences of myregex matches with themselves ($& is a backreference to the whole match value) while appending two \f chars right after each match.
  • Related