Home > OS >  Debugging .NET Memory leak calling StringBuilder.ToString() inside while loop
Debugging .NET Memory leak calling StringBuilder.ToString() inside while loop

Time:11-09

Background:

I am downloading a large (>500mb) text file which contains lots of SQL statements which I need to run across a database. To achieve this, I am processing the file line by line until I find a full query and then execute it. When running this application, the logic inside the while loop uses more memory than anticipated.

I have removed the code which is running the query against the database - after debugging it doesn't seem the be what is causing the issue.

Code:

Below is some example code to demonstrate - obviously this is not the full program, but this is where I've narrowed the problem down to. Note that sr is a StreamReader which has been initialised to read from my MemoryStream.

StringBuilder query = new StringBuilder();

while (!sr.EndOfStream)
{
    query.AppendLine(await sr.ReadLineAsync());

    string currentQueryString = query.ToString();

    if (currentQueryString.EndsWith($"{Environment.NewLine}GO{Environment.NewLine}"))
    {
        // Run query against database

        // Clean up StringBuilder so it can be used again
        query = new StringBuilder();
        currentQueryString = "";
    }
}

For this example, let's say that every new line in the file could be between 1 and 300 characters long. Also 99% of the queries are INSERT statements containing 1,000 records (each record on a new line).

When I run the application:

I can see in my Windows Task Manager that as the application runs, the memory allocated to the app increases what looks like almost every iteration of the while loop. I placed a break point on currentQueryString = ""; and every time it gets hit (knowing that I've just read another 1,000 lines of the file into memory) I can see that memory used by the application increases (this time from using the Diagnostic Tools inside Visual Studio) anywhere from 100mb to 200mb roughly, however from taking snapshots each time the breakpoint is hit I can see that the Heap Size is barely changing, maybe a few hundred kb either way.

What I think is causing the issue:

My best guess at the moment is that the string currentQueryString = query.ToString(); line is somehow initialising a variable possibly in unmanaged memory which is not being released. One reason for this is I tested with the following code which removes calling toString() on the StringBuilder and the memory usage is drastically lower as it only increases by about 1-2mb or so for every 1,000 lines processed:

while (timer.Elapsed.TotalMinutes < 14 && !sr.EndOfStream && !killSwitch)
{
    query.AppendLine(await sr.ReadLineAsync());

    currentExeQueryCounter  = 1;

    if (currentExeQueryCounter > 1000)
    {
        query = new StringBuilder();
        currentExeQueryCounter = 0;
    }
}

For debugging purposes only I added in GC.Collect() underneath currentQueryString = ""; in the first code snippet which completely resolved the issue (observed in both Visual Studio Diagnostic Tools and Task Manager) and I am trying to understand why this is and how I can best address this issue as I aim to run this as a serverless application which will be allocated a limited amount of memory.

CodePudding user response:

Just increasing memory usage does not indicate a memory leak, the Garbage collector will run according to its own rules, for example when there is not sufficient memory. If inserting a GC.Collect resolves it there was probably never a leak to begin with.

Whenever there is a potential memory problem I would recommend using a memory profiler. This should allow you to trigger GCs at will, collect snapshots of all the allocated objects, and compare them to see if some kind of object count is steadily increasing.

That said, I would suggest changing query = new StringBuilder(); to query.Clear(). No need to re-allocate a bunch of memory when you already have a buffer available.

You could perhaps further reduce allocation rate by using Span<char>/Memory<char> as much as possible rather than strings. This should let you refer to a specific sequence of characters within a larger buffer without doing any copying or allocation at all. This was a major reason for Span<>, since copying strings is a bit inefficient when doing lots of xml/json deserialization & html processing.

CodePudding user response:

To add to the very sensible answer from JonasH: The call to query.ToString() is probably costing you quite a bit. It is also an unnecessarily convoluted way to check for a line that says "GO". If you instead just compare the most recently read line to "GO", you can cut down on the ToString() calls. For example, something like this:

string line = await sr.ReadLineAsync();
query.AppendLine(line);

if (line == "GO")
{
    string currentQueryString = query.ToString();

    // Run query against database

    query.Clear(); // Clean up StringBuilder so it can be used again
}

CodePudding user response:

If you draw on a piece of paper how memory allocations occure for some execution cycles:

1. query sb initial allocation
2. ReadLineAsync retval allocation
3. ReadLineAsync retval marked free
4. currentQueryString allocation
5. $"..." allocation (maybe compiler optimize this out from cycle)
6. new query sb initial allocation
7. "old" query sb marked free
8. new currentQueryString allocation
9. "old" currentQueryString marked free
99. (here starts the Nth cycle)
100. new ReadLineAsync retval allocation
101. at some point query sb preallocation become too small, so here comes a bigger query sb allocation and the "old" area marked free
102. new ReadLineAsync retval marked free
103. new currentQueryString allocation
104. $"..." allocation (maybe compiler optimize this out from cycle)
105. new query sb initial allocation
106. "old" query sb marked free
107. new currentQueryString allocation
108. "old" currentQueryString marked free
(etc)

Memory allocation logic will each time look for next available free block large enough to hold the requested size. If it finds no such big area it will ask from OS a new block. If it cannot get then it will call GC, which will really free all blocks marked free and if there is still no big enough block it will rearrange the occupied blocks to merge free pieces. GC won't merge blocks on Large Object Heap where allocations bigger than 88kb (as I remember) occure. These GC steps are hitting your performance.

So if the size of allocations occuring during steps 1-108 have a growing pattern (sb allocation is a potential suspect) then you will see a continously growing memory usage (because eating memory much faster than making GC steps above). If this growth results in moving to Large Object Heap then you risk getting OutOfMemoryException at some point. If you take a dump of this situation you may see gigabytes of free inprocess memory but still receiving OOM! (this happened to me)

This is just a technical explanation, other's solutions are correct. Always try to reuse already allocated memory, especially if you do something in cycles , especially if those cycles have undefined repeat count:

  • use ReadLineAsync retval for "GO" check instead of a new currentQueryString allocation
  • don't count on compiler optimizes $"..." string out of your cycle.
  • preallocate StringBuilder large enough for longest expected content
  • reuse StringBuilder (clear it) instead of instantiating new one
  • Related