I am just considering which provides me the best performance when I use both OrderBy()
and Distinct()
inside a LINQ Query. It seems to me they're both equal in speed as the Distinct()
method will use a hash table while in-memory and I assume that any SQL query would be optimized first by .NET before it gets executed.
Am I correct in assuming this or does the order of these two commands still affect the performance of LINQ in general?
CodePudding user response:
First, seq.OrderBy(...).Distinct()
and seq.Distinct().OrderBy(...)
are not guaranteed to return the same result, because Distinct()
may return an unordered enumeration. MS implementation conveniently preserves the order, but if you pass a LINQ query to the database, the results may come back in any order the DB engine sees fit.
Second, in the extreme case when you have lots of duplication (say, five values repeated randomly 1,000,000 times) you would be better off doing a Distinct
before OrderBy()
.
Long story short, if you want your results to be ordered, use Distinct().OrderBy(...)
regardless of the performance.
CodePudding user response:
For LINQ to objects even if we assume that that OrderBy(...).Distinct()
and Distinct().OrderBy(...)
will return the same result (which is not guaranteed) the performance will depend on the data.
If you have a lot of duplication in data - running Distinct
first should be faster. Next benchmark shows that (at least on my machine):
public class LinqBench
{
private static List<int> test = Enumerable.Range(1, 100)
.SelectMany(i => Enumerable.Repeat(i, 10))
.Select((i, index) => (i, index))
.OrderBy(t => t.index % 10)
.Select(t => t.i)
.ToList();
[Benchmark]
public List<int> OrderByThenDistinct() => test.OrderBy(i => i).Distinct().ToList();
[Benchmark]
public List<int> DistinctThenOrderBy()=> test.Distinct().OrderBy(i => i).ToList();
}
On my machine for .Net Core 3.1 it gives:
Method | Mean | Error | StdDev |
---|---|---|---|
OrderByThenDistinct | 129.74 us | 2.120 us | 1.879 us |
DistinctThenOrderBy | 19.58 us | 0.384 us | 0.794 us |
CodePudding user response:
I assume that any SQL query would be optimized first by .NET before it gets > executed.
And how do you think that would work, given that:
- Only the SQL executing side (the server) has the knowledge for this (i.e. which indices to use) AND has a query optimizer that is supposed to optimize the executed query based on the statistics of the table.
- You have to be VERY sure that you do not change the result in any way.
Sorry, this makes no sense - there are pretty much no optimizations that you CAN safely do in C# without having all the internal details of the database, so the query is sent to the database for analysis.
As such, an OrderBy or a Distinct (ESPECIALLY a distinct) WILL impact performance - how much depends on i.e. whether the OrderBy can rely on an index.
or does the order of these two commands still affect the performance of LINQ in general?
Here it gets funny (and you give no example).
DISTINCT and ORDERBY are in SQL in a specific order, regardless how you formulated it in LINQ. There is only ONE allowed syntax as per SQL definition. LINQ puts the query together and optimizes that out. If you look at the syntax, there is a specific place for the DISTINCT (which is a SQL term for at least SQL Server) and the OrderBy.
On the other side...
.Distinct().OrderBy() and .OrderBy().Distinct()
have DIFFERENT RESULTS. They CAN be done in SQL (you can use the output of the Distinct as a virtual table that you then order), but they have a different semantic. Unless you think that LINQ will magically read your mind, there is no context for the compiler other than to assume you are competent in writing what you do (as long as it is legal) and execute these steps in the order you gave.
Except: The DOCUMENTATION for Distinct in Queryable is clear this is not done:
says that Distinct returns an unordered list.
So, there is a fundamental difference and they are not the same.