I have a class that generates DNA sequences, that are represented by long strings. This class implements the IEnumerable<string>
interface, and it can produce an infinite number of DNA sequences. Below is a simplified version of my class:
class DnaGenerator : IEnumerable<string>
{
private readonly IEnumerable<string> _enumerable;
public DnaGenerator() => _enumerable = Iterator();
private IEnumerable<string> Iterator()
{
while (true)
foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
yield return new String(c, 10_000_000);
}
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
This class generates the DNA sequences by using an iterator. Instead of invoking the iterator again and again, an IEnumerable<string>
instance is created during the construction and is cached as a private field. The problem is that using this class results in a sizable chunk of memory being constantly allocated, with the garbage collector being unable to recycle this chunk. Here is a minimal demonstration of this behavior:
var dnaGenerator = new DnaGenerator();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
DoWork(dnaGenerator);
GC.Collect();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
GC.KeepAlive(dnaGenerator);
static void DoWork(DnaGenerator dnaGenerator)
{
foreach (string dna in dnaGenerator.Take(5))
{
Console.WriteLine($"Processing DNA of {dna.Length:#,0} nucleotides"
$", starting from {dna[0]}");
}
}
Output:
TotalMemory: 84,704 bytes
Processing DNA of 10,000,000 nucleotides, starting from A
Processing DNA of 10,000,000 nucleotides, starting from C
Processing DNA of 10,000,000 nucleotides, starting from G
Processing DNA of 10,000,000 nucleotides, starting from T
Processing DNA of 10,000,000 nucleotides, starting from A
TotalMemory: 20,112,680 bytes
My expectation was that all generated DNA sequences would be eligible for garbage collection, since they are not referenced by my program. The only reference that I hold is the reference to the DnaGenerator
instance itself, which is not meant to contain any sequences. This component just generates the sequences. Nevertheless, no matter how many or how few sequences my program generates, there are always around 20 MB of memory allocated after a full garbage collection.
My question is: Why is this happening? And how can I prevent this from happening?
.NET 6.0, Windows 10, 64-bit operating system, x64-based processor, Release built.
Update: The problem disappears if I replace this:
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
...with this:
public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();
But I am not a fan of creating a new enumerable each time an enumerator is needed. My understanding is that a single IEnumerable<T>
can create many IEnumerator<T>
s. AFAIK these two interfaces are not meant to have an one-to-one relationship.
CodePudding user response:
The problem is caused by the auto-generated implementation for the code using yield
.
You can mitigate this somewhat by explicitly implementing the enumerator.
You have to fiddle it a bit by calling .Reset()
from public IEnumerator<string> GetEnumerator()
to ensure the enumeration restarts at each call:
class DnaGenerator : IEnumerable<string>
{
private readonly IEnumerator<string> _enumerable;
public DnaGenerator() => _enumerable = new IteratorImpl();
sealed class IteratorImpl : IEnumerator<string>
{
public bool MoveNext()
{
return true; // Infinite sequence.
}
public void Reset()
{
_index = 0;
}
public string Current
{
get
{
var result = new String(_data[_index], 10_000_000);
if ( _index >= _data.Length)
_index = 0;
return result;
}
}
public void Dispose()
{
// Nothing to do.
}
readonly char[] _data = { 'A', 'C', 'G', 'T' };
int _index;
object IEnumerator.Current => Current;
}
public IEnumerator<string> GetEnumerator()
{
_enumerable.Reset();
return _enumerable;
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
CodePudding user response:
Note that 10_000_000 of chars (which are 16 bit) will take approximately 20 MB. If you will take a look at the decompilation
you will notice that yeild return
results in internal <Iterator>
class generated which in turn has a current
field to store the string (to implement IEnumerator<string>.Current
):
[CompilerGenerated]
private sealed class <Iterator>d__2 : IEnumerable<string>, IEnumerable, IEnumerator<string>, IEnumerator, IDisposable
{
...
private string <>2__current;
...
}
And Iterator
method internally will be compiled to something like this:
[IteratorStateMachine(typeof(<Iterator>d__2))]
private IEnumerable<string> Iterator()
{
return new <Iterator>d__2(-2);
}
Which leads to the current string always being stored in memory for _enumerable.GetEnumerator();
implementation (after iteration start) while DnaGenerator
instance is not GCed itself.