IO Bound Operation and Task.Run()-CodePudding

I am quite new to concurrency (and C#, actually). I have a bunch of csv files in two separate directory to be read, and then I want to do some processing after I read a file. The processing is independent of other data read and process operations. After all the processing are done, I want to update the UI. The UI needs to be responsive at the mean time too because I will need to display a progress bar. Currently I have something like this:

private string _directoryA;
private string _directoryB;

// The user clicks the button
private void ButtonPressed()
{
    Task.Run(() => DoJob());
} 

private void DoJob()
{
    var tasks = new List<Task>();
    var watch = Stopwatch.StartNew();

    tasks.Add(Task.Run(() => DoJobForDirectory(_directoryA).ContinueWith(t => Console.WriteLine("First Half");
    tasks.Add(Task.Run(() => DoJobForDirectory(_directoryB).ContinueWith(t => Console.WriteLine("Second Half");
    
    Task.WaitAll(tasks.ToArray());

    watch.Stop();
    Console.WriteLine($"Time Taken : {watch.ElapsedMilliseconds} ms.");
    UpdateUI();
}

private void DoJobForDirectory(string directory)
{
    var files = Directory.EnumerateFiles(directory, "*.csv");
    var tasks = new List<Task>();
    foreach (var file in files)
    {
        // Update the progress bar in the UI when a file has finished processing
        tasks.Add(Task.Run(() => DoJobForFile(file)).ContinueWith(t => UpdateCounter  ));
    }

    Task.WaitAll(tasks.ToArray());
}

private void DoJobForFile(string filePath)
{
    ReadCSV();
    ProcessData();
    ...
}

I feel like I am missing something here. From my reading this operation should be I/O bound, as the processing afterwards is pretty lightweight (some for loops and assignments). So I really should be using just async await, but not Task.Run()...? However I couldn't think of a better way to do this. The ReadCSV() is from some library that does not have the async version. Using Parallel.ForEach does not boost the performance too. Is there a better way to do this (to be efficient on resources and also achieve better performance)?

Also, when I tried to only run on one directory, the elapsed time would be nearly half of the time required for both directories. Since the operations are all independent, I want to run them all in parallel, so processing both directories should take roughly the same (or only slightly more) time as processing just single directory, but not two times slower. It seems like no matter how many Task.Run() I do, I will have a limited number of threads running at the same time (some bottleneck). I tried changing all the Task.Run() to be new Thread(), and observed much more threads were active at the same time, but in the end resulted worse performance. Why is that?

CodePudding user response：

The Task.Run(() => DoJob()); and using Task.WaitAll() is wasting a thread.

I would change it to this:

private string _directoryA;
private string _directoryB;

// The user clicks the button
private async void ButtonPressed()
{
    // disable UI controls
    try
    {
        await DoJob();
    }
    finally
    {
        // enable UI controls.
    }
} 

private async Task DoJob()
{
    var tasks = new List<Task>();
    var watch = Stopwatch.StartNew();

    tasks.Add(Task.Run(async () => DoJobForDirectory(_directoryA).ContinueWith(t => Console.WriteLine("First Half");
    tasks.Add(Task.Run(async () => DoJobForDirectory(_directoryB).ContinueWith(t => Console.WriteLine("Second Half");
    
    await Task.WhenAll(tasks.ToArray());

    watch.Stop();
    Console.WriteLine($"Time Taken : {watch.ElapsedMilliseconds} ms.");
    UpdateUI();
}

private async Task DoJobForDirectory(string directory)
{
    var files = Directory.EnumerateFiles(directory, "*.csv");
    var tasks = new List<Task>();
    foreach (var file in files)
    {
        // Update the progress bar in the UI when a file has finished processing
        tasks.Add(Task.Run(() => DoJobForFile(file)).ContinueWith(t => UpdateCounter  ));
    }
    
    await Task.WhenAll(tasks.ToArray());
}

private void DoJobForFile(string filePath)
{
    ReadCSV();
    ProcessData();
    ...
}

If you want to limit the thread, you could use a SemaphoreSlim. Here is a good example as accepted answer:

How to limit the Maximum number of parallel tasks in c#

CodePudding user response：

The Task.Run schedules work on the ThreadPool, which is a conservative mechanism regarding how many threads it creates immediately on demand (it creates as many as the available cores of the machine), and on how frequently it creates new threads when the demand for work is high (one new thread every second). You could try experimenting with the ThreadPool.SetMinThreads method that affects the behavior of the ThreadPool. For example:

ThreadPool.SetMinThreads(100, 100);

This way the ThreadPool will create 100 threads immediately on demand, before switching to the conservative algorithm.

Chances are that you'll see no improvement on the performance of your directory-processing application. That's because your I/O bound workload is throttled by the capabilities of your storage device. No matter what you do with code, the hardware has a limit on how many data can store or retrieve per time-unit. When you reach this limit, the only way to boost the performance is to upgrade your hardware.

Regarding the suitability of using Task.Run and synchronous APIs for doing I/O bound work, surprisingly in many cases it's the most performant way of getting the job done. The synchronous file-system APIs in particular are significantly faster than their asynchronous counterparts. What you lose with the synchronous APIs is memory-efficiency. Each thread requires at least 1 MB of memory for its stack, so if you start 1,000 threads at once you'll deprive your system from 1 GB of memory or more, which can affect negatively the performance of your application indirectly.

Starting manually tasks with Task.Run for the purpose of parallelization, is a low lever approach at parallelizing your work. The TPL offers higher level Task-based tools, like the Parallel class, the PLINQ library (.AsParallel) and the TPL Dataflow library.

For updating the UI with progress information during a background work, the modern approach is the IProgress<T> interface and the Progress<T> class. You can find an example here, as part of a comparison between the Task.Run and the BackgroundWorker class.