Performance issue when using multithreading in C#-CodePudding

I have a file that contains 125 line like that :

blue
black
yellow
...
purple

I want to create 5 threads and thoses 5 threads will take 25 different lines on the file and print them to the console window, It doesn't matter if they aren't printed in ascending order as long as they are printing every lines.

The code that I tried look like this :

string[] colors = File.ReadAllLines("colors.txt");
Thread[] threads = new Thread[5];
Console.WriteLine(threads.Length); // 5

for (int i = 0; i < threads.Length; i  )
{
    int indexStart = (colors.Length) * i / threads.Length;
    int indexStop = (colors.Length) * (i   1) / threads.Length;
    new Thread(() =>
    {
        for (int j = indexStart; j < indexStop; j  )
        {
            Console.WriteLine(colors[j]);
        }
    }).Start();
}

Console.ReadLine();

It looks like when running the program it's as faster as a single threaded program, what am I doing wrong ?

CodePudding user response：

I guess it happens because there is only one thread that can access the console for writing out results. Sure, the console is thread safe, they won't interfere with one another, but you can't possibly write from 2 or more threads at the same time.

CodePudding user response：

In general, in any language when dealing with input / output, multiple threads do not lead to a performance increase. I/O of all types tends to be serial - there's only one SATA channel on the SSD, the NIC has only one ethernet cable attached, there's only one GPU rendering the console window, etc. Every piece of data input / output has to go through that single interface.

Almost by definition, I/O is slow in comparison to the CPU and its memory, so having multiple threads trying to access the I/O resource is not going to lead to that resource working any quicker. E.g. the speed of an SSD might seem fast, but it's crawling along very slowly in comparison to the CPU and its memory.

Also generally, in such situations, having multiple threads contending for a limited I/O resource will actually slow things down fractionally; time is lost switching between the threads getting served by the I/O device.

When to Thread, Or Not?

The art of multithreaded programming is to know when an application is going to be I/O-limited, or compute-limited. If limited by the I/O performance, there's no point adding threads. If limited by the speed of a single CPU core, there is value in adding threads.

However, it doesn't end there, as the CPU's main memory is in itself an I/O device (RAM chips on the end of a memory bus), as are the CPU's caches. If your "work" is trivial per data item, e.g. simply adding up numbers in a large array, then you're likely to end up I/O bound on the main memory; the data cannot be shuffled from memory into the CPU fast enough, and the CPU ends up waiting for data to arrive. There is no point in adding more threads, as they're not going to result in main memory working any faster.

Things can get very complex in the interaction between CPU, cache and memory; on the PowerPC for instance, you could (in the right circumstances) eek out a bit more performance by multiplying data by 1. Nugatory work you might think, but the extra operation could result in cache and memory interacting better, leading to a faster result.

So knowing when to thread and when not to (for the sake of improving algoroithm throughput) means having a good idea of how much work in the algorithm you're implementing can be done on small chunks of data before having to write a result back to main memory.

What you're aiming for is to have a quantity of data and code that both fit inside L1 cache that will do as many useful operations as possible on that data before a result has to be written back to somewhere else in memory, or fresh data is required from some other part of memory. The more work that code can usefully do on that data in L1 cache, the fewer L1 cache misses (and associated delays). That way, extra threads will add performance because on a multi-core CPU you will then have all cores humming along nicely, not often contending for access to L3 cache or main memory.

Where C# makes this harder is that, as a garbage collected language, there's more code running as part of your process than the source code you've written (things like the GC). It's also hard to know exactly what's going to happen (in terms of memory actions) for any one given line of code. All this can upset the apple cart.

A nice thing about doing this in C is that it's very clear in code what the code is doing to memory. Very little gets hidden by the syntax of the language.

Cell Processor

Ok, so if you're developing such a piece of code, and if you're having to estimate work done and so forth to break down the algorithm into threads, what helps and what gets in the way? Well, actually, today's SMP implementations in almost all CPUs (ARM, Intel, AMD, etc) are a hinderance, because it's very difficult to fully understand everything the core, cache and memory subsystem will do. Intel provide some software tools to help with this kind of thing - VTune Profiler - to help developers understand how a piece of code is interacting with the CPU, caches, etc.

This is where the Cell processor (as found in the Sony Play Station 3) was good. It had multiple maths cores, but there was no cache, and no SMP. The silicon that might otherwise have been cache was on-chip RAM for 8 separate SPE cores (maths cores). The developer has to write code explicitly to load data and code from memory into a core, it had to fit and run in this limited amount of on-chip memory, and results written back to main memory (or into the memory of another core). Or the results could stay still and new code loaded into the core.

It sounds complicated, but it was good because you didn't have to understand what the chip was going to do yb way of caches, memory; it did nothing. So, actually, this simplified the task of breaking down an algorithm into core-sized chunks, because your break-down either fitted inside the fixed amount of per-core memory, or it didn't. If it did fit, it would be lightning fast and scalable. If it didn't, it wouldn't run at all. Ok, so it was hard to get to that point, but when you did the results were superb.

For outright performance, it took Intel a long time to produce an x64 CPU that could match the Cell processor.