Create Hashset with a large number of elements (1M)-CodePudding

I have to create a HashSet with the elements from 1 to N 1, where N is a large number (1M).

For example, if N = 5, the HashSet will have then integers {1, 2, 3, 4, 5, 6 }.

The only way I have found is:

HashSet<int> numbers = new HashSet<int>(N);

for (int i = 1; i <= (N   1) ; i  )
{
    numbers.Add(i);
}

Are there another faster (more efficient) ways to do it?

CodePudding user response：

6 is a tiny number of items so I suspect the real problem is adding a few thousand items. The delays in this case are caused by buffer reallocations, not the speed of Add itself.

The solution to this is to specify even an approximate capacity when constructing the HashSet :

var set=new HashSet<int>(1000);

If, and only if, the input implements ICollection<T>, the HashSet<T>(IEnumerable<T>) constructor will check the size of input collection and use it as its capacity:

if (collection is ICollection<T> coll)
{
    int count = coll.Count;
    if (count > 0)
    {
        Initialize(count);
    }
}

Explanation

Most containers in .NET use buffers internally to store data. This is far faster than implementing containers using pointers, nodes etc due to CPU cache and RAM access delays. Accessing the next item in the CPU's cache is far faster than chasing a pointer in RAM in all CPUs.

The downside is that each time the buffer is full a new one will have to be allocated. Typically, this buffer will have twice the size of the original buffer. Adding items one by one can result in log2(N) reallocations. This works fine for a moderate number of items but can result in a lot of orphaned buffers when adding eg 1000 items one by one. All those temporary buffers will have to be garbage collected at some point, causing additional delays.

CodePudding user response：

Here's the code to test the three options:

var N = 1000000;
var trials = new List<(int method, TimeSpan duration)>();

for (var i = 0; i < 100; i  )
{
    var sw = Stopwatch.StartNew();
    HashSet<int> numbers1 = new HashSet<int>(Enumerable.Range(1, N   1));
    sw.Stop();
    trials.Add((1, sw.Elapsed));
    sw = Stopwatch.StartNew();
    HashSet<int> numbers2 = new HashSet<int>(N);
    for (int n = 1; n < N   1; n  )
        numbers2.Add(n);
    sw.Stop();
    trials.Add((2, sw.Elapsed));
    HashSet<int> numbers3 = new HashSet<int>(N);
    foreach (int n in Enumerable.Range(1, N   1))
        numbers3.Add(n);
    sw.Stop();
    trials.Add((3, sw.Elapsed));
}

for (int j = 1; j <= 3; j  )
    Console.WriteLine(trials.Where(x => x.method == j).Average(x => x.duration.TotalMilliseconds));

Typical output is this:

31.314788
16.493208
16.493208

It is nearly twice as fast to preallocate the capacity of the HashSet<int>.

There is no difference between the traditional loop and a LINQ foreach option.

CodePudding user response：

To build on @Enigmativity's answer, here's a proper benchmark using BenchmarkDotNet:

public class Benchmark
{
    private const int N = 1000000;

    [Benchmark]
    public HashSet<int> EnumerableRange() => new HashSet<int>(Enumerable.Range(1, N   1));

    [Benchmark]
    public HashSet<int> NoPreallocation()
    {
        var result = new HashSet<int>();
        for (int n = 1; n < N   1; n  )
        {
            result.Add(n);
        }
        return result;
    }

    [Benchmark]
    public HashSet<int> Preallocation()
    {
        var result = new HashSet<int>(N);
        for (int n = 1; n < N   1; n  )
        {
            result.Add(n);
        }
        return result;
    }
}

public class Program
{
    public static void Main(string[] args)
    {
        BenchmarkRunner.Run(typeof(Program).Assembly);
    }
}

With the results:

Method	Mean	Error	StdDev
EnumerableRange	29.17 ms	0.743 ms	2.179 ms
NoPreallocation	23.96 ms	0.471 ms	0.775 ms
Preallocation	11.68 ms	0.233 ms	0.665 ms

As we can see, using linq is a bit slower than not using linq (as expected), and pre-allocating saves a significant amount of time.