Consider this example:
List<int> GetLengths(ICollection<string> strings) =>
strings
.Select(s => s.Length)
.ToList();
If Select()
checked that the input collection implements ICollection<T>
and that its length is known in advance, it could have returned ICollection<T>
too, so that ToList()
would initialize the resulting list with capacity. Instead, the list adds each value one by one, expanding its storage log(N) times.
Is there any reason it's not done so in LINQ?
Update: since there are many questions about my suggestion, here are some justifications for my concept:
- LINQ already returns many different implementation of
IEnumerable<T>
. IMO there is nothing wrong with adding another iterator that implements one more interface. - A read-only
ICollection<T>
doesn't have to be materialized in memory, it only has to haveCount
. Here is an example of a bare implementation ofICollection<T>
that behaves similar toEnumerable.Repeat<T>()
except it calls a delegate to generate each element. It throws exceptions left and right, but so the stockReadOnlyCollection<T>
does. - The
List<T>(IEnumerable<T> collection)
constructor already checks ifcollection
also implementsICollection<T>
, so that the list can allocate its storage in advance. It doesn't violate any interfaces or conventions. - From the architectural point of view, implementing
IReadOnlyCollection<T>
would make more sense, but unfortunately, it's often neglected in the BCL itself, and theList<T>
constructor doesn't check it.
CodePudding user response:
Optimizing the source.Select().ToList()
, as well as the source.Select().ToDictionary()
, was discussed a few years ago (2015) in this GitHub issue: System.Linq performance improvement suggestions
The suggestions were reviewed, and the decision was to not implement the optimizations. Here is an excerpt from the last post in the thread, by Stephen Toub, Sep 15, 2015.
Through this thread as well as a bunch of PRs that have happened in the interim, we did agree that we don't want to expose existing collection interfaces out of LINQ, e.g. the object returned by Select shouldn't implement
ICollection<T>
, but we are ok with and have introduced additional internal interfaces to help curry certain information through the built-in operators, e.g. the iterator implemented bySelect
implementsIArrayProvider
so that itsToArray
implementation can be optimized to work on the original array with complete knowledge of its size, using its indexer rather than going throughIEnumerable
, etc.At this point, given the original tasks called out on this thread, it looks like the only concrete remaining items would be around specializing methods like
ToDictionary
to have more knowledge of the original source. I suggest that if someone wants to implement such optimizations and do detailed measurements, such optimizations with associated measurements could be submitted as individual PRs, and an issue like this isn't needed to track them. So I'm going to leave this closed and not open new issues. Obviously folks should feel free to open subsequent issues / PRs for individual topics to be discussed / solutions implemented.
IMHO these internal LINQ optimizations are very much desirable.
Update: Apparently the source.Select().ToList()
and source.Select().ToArray()
operations have been optimized at a later stage of .NET's evolution, as can be demonstrated experimentally with this demo:
byte[] source = new byte[1_000_000];
long mem0 = GC.GetTotalAllocatedBytes(true);
List<byte> result = source.Select(n => (byte)1).ToList();
long mem1 = GC.GetTotalAllocatedBytes(true);
Console.WriteLine($"Allocated: {mem1 - mem0:#,0} bytes");
Output on .NET 6, Release mode:
Allocated: 1,000,232 bytes
So the byte
array was projected and then materialized to a
List<byte>
in one go, without log(N) allocations and expansions.
CodePudding user response:
If
Select()
checked that the input collection implementsICollection<T>
and that its length is known in advance, it could have returnedICollection<T>
too
Whilst it could, that would mean Select
would have to immediately return a materialised collection.
Select
returns an iterator, which can be passed through other LINQ methods (e.g. Where
), whilst still requiring only a single iteration of the source.
Returning a collection at each stage would likely be less efficient in many queries.