I need the fastest way to process math on numerical array data-CodePudding

I apologize if this is in the incorrect forum. Despite finding a lot of Array manipulation on this site, most of these are averaging/summing... the array of numerics as a set using LINQ, which processes well for all values in an array. But I need to process each index over multiple arrays (of the same size).

My routine receives array data from devices, typically double[512] or ushort[512]; A single device itself will always have the same size of Array data, but the array sizes can range from 256 to 2048 depending on the device. I need to hold CountToAverage quantity of the arrays to average. Each time an array is received, it must push and pop from the queue to ensure that the number of arrays in the average process is consistent (this part of the process is fixed in the Setup() for this benchmark testing. For comparison purposes, the benchmark results are shown after the code.

What I am looking for is the fastest most efficient way to average the values of each index of all the arrays, and return a new array (of the same size) where each index is averaged from the set of arrays. The count of arrays to be averaged can range from 3 to 25 (the code below sets benchmark param to 10). I have 2 different averaging methods in the test, the 2nd is significantly faster, 6-7 times faster than the first. My first question is; Is there any way to achieve this faster, that can be done at O(1) or O(log n) time complexity?
Secondarily, I am using a Queue (which may be changed to ConcurrentQueue for implementation) as a holder for the arrays to be processed. My primary reasoning for using a queue is because I can guarantee FIFO processing of the feed of arrays which is critical. Also, I can process against the values in the Queue through a foreach loop (just like a List) without having to dequeue until I am ready. I would be interested if anyone knows whether this is performance hindering as I haven't benchmarked it. Keep in mind it must be thread-safe. If you have an alternative way to process multiple sets of array data in a thread-safe manner I am "all ears".

The reason for the performance requirement is this is not the only process that is happening, I have multiple devices that are sending array results "streamed" at an approximate rate of 1 every 1-5 milliseconds, for each device coming from different threads/processes/connections, that still has several other much more intensive algorithms to process through, so this cannot be a bottleneck.

Any insights on optimizations and performance are appreciated.

using System;
using System.Collections.Generic;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using Microsoft.Diagnostics.Tracing.Parsers.MicrosoftAntimalwareEngine;

namespace ArrayAverage
{
    public class ArrayAverage
    {
        [Params(10)]
        public int CountToAverage;

        [Params(512, 2048)]
        public int PixelSize;

        static Queue<double[]> calcRepo = new Queue<double[]>();
        static List<double[]> spectra = new();
        
        [Benchmark]
        public double[] CalculateIndexAverages()
        {
            // This is too slow
            var avg = new double[PixelSize];
            for (int i = 0; i < PixelSize; i  )
            {
                foreach (var arrayData in calcRepo)
                {
                    avg[i]  = arrayData[i];
                }
                avg[i] /= calcRepo.Count;
            }
            return avg;
        }
        
        [Benchmark]
        public double[] CalculateIndexAverages2()
        {
            // this is faster, but is it the fastest?
            var sum = new double[PixelSize];
            int cnt = calcRepo.Count;
            foreach (var arrayData in calcRepo)
            {
                for (int i = 0; i < PixelSize; i  )
                {
                    sum[i]  = arrayData[i];
                }
            }

            var avg = new double[PixelSize];
            for (int i = 0; i < PixelSize; i  )
            {
                avg[i] = sum[i] / cnt;
            }

            return avg;
        }
        
        [GlobalSetup]
        public void Setup()
        {
            // Just generating some data as simple Triangular curve simulating a range of spectra
            for (double offset = 0; offset < CountToAverage; offset  )
            {
                var values = new double[PixelSize];
                var decrement = 0;
                for (int i = 0; i < PixelSize; i  )
                {
                    if (i > (PixelSize / 2))
                        decrement--;
                    values[i] = (offset / 7)   i   (decrement * 2);
                }
                calcRepo.Enqueue(values);
            }
        }        
    }
    
    public class App
    {
        public static void Main()
        {
            BenchmarkRunner.Run<ArrayAverage>();
        }
    }
}

Benchmark results:


BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1348 (21H1/May2021Update)
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.7.21379.14
  [Host]     : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT  [AttachedDebugger]
  DefaultJob : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT

Method	Arrays To Average	Array Size	Mean	Error	StdDev
CalculateIndexAverages	10	512	32.164 μs	0.5485 μs	0.5130 μs
CalculateIndexAverages2	10	512	5.792 μs	0.1135 μs	0.2241 μs
CalculateIndexAverages	10	2048	123.628 μs	2.3394 μs	1.9535 μs
CalculateIndexAverages2	10	2048	22.311 μs	0.4366 μs	0.8093 μs

CodePudding user response：

When dealing with simple operations on a large amount of data, you'd be very interested in SIMD:

SIMD stands for "single instruction, multiple data". It’s a set of processor instructions that ... allows mathematical operations to execute over a set of values in parallel.

In your particular case, using the the Vector<T> example would give you a quick win. Naively converting your fastest method to use Vectors already gives a ~2x speed up on my PC.

public double[] CalculateIndexAverages4() {
    // Assumption: PixelSize is a round multiple of Vector<>.Count
    // If not, you'll have to add in the 'remainder' from the example.
    var batch = Vector<double>.Count;
    
    var sum = new double[PixelSize];
    foreach (var arrayData in calcRepo) {
        // Vectorised summing:
        for (int i = 0; i <= PixelSize - batch; i  = batch) {
            var vSum = new Vector<double>(sum, i);
            var vData = new Vector<double>(arrayData, i);
            (vSum   vData).CopyTo(sum, i);
        }
    }

    var vCnt = Vector<double>.One * calcRepo.Count;
    // Reuse sum[] for averaging, so we don't incur memory allocation cost
    for (int i = 0; i <= PixelSize - batch; i  = batch) {
        var vSum = new Vector<double>(sum, i);
        (vSum / vCnt).CopyTo(sum, i);
    }
    return sum;
}

The Vector<T>.Count gives you how many items are being parallelised into one instruction. In the case of double, it's likely to be 4 on most modern CPUs supporting AVX2.

If you're okay with losing precision and can go to float, you'll get a much bigger win by again doubling the amount of data processed in a single CPU op. All of this without even changing your algorithm.

CodePudding user response：

In my opinion this would be fairly well optimized code. A major reason for the second option to be faster is that it access memory linearly, instead of jumping between multiple different arrays. Another factor is that foreach loops have some overhead, so placing this in the outer loop will also help a bit.

You might gain a little bit performance by switching out the queue and foreach loop to a list/array and for loop, but since PixelSize is much larger than CountToAverage I would expect the benefit to be fairly small.

Unrolling the loop to process say 4 values at a time might help a bit. It is possible for the c# compiler to apply such optimization automatically, but it is often difficult to tell what optimization are applied or not, so it might be easier just to test.

The next step would be to look at parallelization. Simple summing code like this might benefit a from SIMD to process multiple values at a time. But the link shows that using processor specific intrinsic has a much larger benefit over the more general Vector<T>, but may require separate code paths for each platform you are targeting. The link also have performance examples of summing values at various levels of optimization, with example code, so is well worth a read.

Another option would be to use multiple threads with Parallel.For/Foreach, but at 6μs it is likely that the overhead will be larger than any gains unless the size of the data is significantly larger.