I have to create a HashSet with the elements from 1 to N 1, where N is a large number (1M).
For example, if N = 5, the HashSet will have then integers {1, 2, 3, 4, 5, 6 }.
The only way I have found is:
HashSet<int> numbers = new HashSet<int>(N);
for (int i = 1; i <= (N 1) ; i )
{
numbers.Add(i);
}
Are there another faster (more efficient) ways to do it?
CodePudding user response:
6 is a tiny number of items so I suspect the real problem is adding a few thousand items. The delays in this case are caused by buffer reallocations, not the speed of Add
itself.
The solution to this is to specify even an approximate capacity when constructing the HashSet :
var set=new HashSet<int>(1000);
If, and only if, the input implements ICollection<T>
, the HashSet<T>(IEnumerable<T>)
constructor will check the size of input collection and use it as its capacity:
if (collection is ICollection<T> coll)
{
int count = coll.Count;
if (count > 0)
{
Initialize(count);
}
}
Explanation
Most containers in .NET use buffers internally to store data. This is far faster than implementing containers using pointers, nodes etc due to CPU cache and RAM access delays. Accessing the next item in the CPU's cache is far faster than chasing a pointer in RAM in all CPUs.
The downside is that each time the buffer is full a new one will have to be allocated. Typically, this buffer will have twice the size of the original buffer. Adding items one by one can result in log2(N) reallocations. This works fine for a moderate number of items but can result in a lot of orphaned buffers when adding eg 1000 items one by one. All those temporary buffers will have to be garbage collected at some point, causing additional delays.
CodePudding user response:
Here's the code to test the three options:
var N = 1000000;
var trials = new List<(int method, TimeSpan duration)>();
for (var i = 0; i < 100; i )
{
var sw = Stopwatch.StartNew();
HashSet<int> numbers1 = new HashSet<int>(Enumerable.Range(1, N 1));
sw.Stop();
trials.Add((1, sw.Elapsed));
sw = Stopwatch.StartNew();
HashSet<int> numbers2 = new HashSet<int>(N);
for (int n = 1; n < N 1; n )
numbers2.Add(n);
sw.Stop();
trials.Add((2, sw.Elapsed));
HashSet<int> numbers3 = new HashSet<int>(N);
foreach (int n in Enumerable.Range(1, N 1))
numbers3.Add(n);
sw.Stop();
trials.Add((3, sw.Elapsed));
}
for (int j = 1; j <= 3; j )
Console.WriteLine(trials.Where(x => x.method == j).Average(x => x.duration.TotalMilliseconds));
Typical output is this:
31.314788
16.493208
16.493208
It is nearly twice as fast to preallocate the capacity of the HashSet<int>
.
There is no difference between the traditional loop and a LINQ foreach
option.
CodePudding user response:
To build on @Enigmativity's answer, here's a proper benchmark using BenchmarkDotNet:
public class Benchmark
{
private const int N = 1000000;
[Benchmark]
public HashSet<int> EnumerableRange() => new HashSet<int>(Enumerable.Range(1, N 1));
[Benchmark]
public HashSet<int> NoPreallocation()
{
var result = new HashSet<int>();
for (int n = 1; n < N 1; n )
{
result.Add(n);
}
return result;
}
[Benchmark]
public HashSet<int> Preallocation()
{
var result = new HashSet<int>(N);
for (int n = 1; n < N 1; n )
{
result.Add(n);
}
return result;
}
}
public class Program
{
public static void Main(string[] args)
{
BenchmarkRunner.Run(typeof(Program).Assembly);
}
}
With the results:
Method | Mean | Error | StdDev |
---|---|---|---|
EnumerableRange | 29.17 ms | 0.743 ms | 2.179 ms |
NoPreallocation | 23.96 ms | 0.471 ms | 0.775 ms |
Preallocation | 11.68 ms | 0.233 ms | 0.665 ms |
As we can see, using linq is a bit slower than not using linq (as expected), and pre-allocating saves a significant amount of time.