I am writing a program which scours the entire filesystem of a computer in order to destroy any files which fall within certain parameters. I want the program to run as fast as possible and utilize as many resources as necessary to achieve this (it's worth noting that the user is expected not to be completing any other work while this process is taking place). To that end, I've written a method which takes a target directory, searches all the files in it, then queues up a new task for each child directory. This is currently done by passing the directories' paths into a queue which the main thread monitors and uses to actually initialize the new tasks, as so:
static class DriveHandler
{
internal static readonly List<string> fixedDrives = GetFixedDrives();
private static readonly ConcurrentQueue<string> _targetPathQueue = new ConcurrentQueue<string>();
private static int _threadCounter = 0;
internal static void WipeDrives()
{
foreach (string driveLetter in fixedDrives)
{
Interlocked.Increment(ref _threadCounter);
Task.Run(() => WalkDrive(driveLetter));
}
while (Volatile.Read(ref _threadCounter) > 0 || !_targetPathQueue.IsEmpty)
{
if (_targetPathQueue.TryDequeue(out string path))
{
Interlocked.Increment(ref _threadCounter);
Task.Run(() => WalkDrive(path));
}
}
}
private static void WalkDrive(string directory)
{
foreach (string file in Directory.GetFiles(directory))
{
//If file meets conditions, delete
}
string[] subDirectories = Directory.GetDirectories(directory);
if (subDirectories.Length != 0)
{
foreach (string subDirectory in subDirectories)
{
_targetPathQueue.Enqueue(subDirectory);
}
}
else { } //do other stuff;
Interlocked.Decrement(ref _threadCounter);
}
}
My question is, is it safe/worth it to just initialize the new tasks from within the already running tasks to avoid wasting processor time monitoring the queue? Something that looks like this:
static class DriveHandler
{
internal static readonly List<string> fixedDrives = GetFixedDrives();
private static int _threadCounter = 0;
internal static void WipeDrives()
{
foreach (string driveLetter in fixedDrives)
{
Interlocked.Increment(ref _threadCounter);
Task.Run(() => WalkDrive(driveLetter));
}
while (Volatile.Read(ref _threadCounter) > 0)
{
Thread.Sleep(5000);
}
}
private static void WalkDrive(string directory)
{
foreach (string file in Directory.GetFiles(directory))
{
//If file meets conditions, delete
}
string[] subDirectories = Directory.GetDirectories(directory);
if (subDirectories.Length != 0)
{
foreach (string subDirectory in subDirectories)
{
Interlocked.Increment(ref _threadCounter);
Task.Run(() => WalkDrive(path));
}
}
else { } //do other stuff;
Interlocked.Decrement(ref _threadCounter);
}
}
I of course need every task to die once it's done, will doing things this way make the old tasks parents to the new ones and keep them alive until all their children have finished?
Many thanks!
CodePudding user response:
First problem:
Task.Run(() => WalkDrive(path));
It's a fire and forget fashion, it's not a good thing to do in this context, why? Because chances are, you have waaay more files and paths on hard disk than a machine have CPU and memory capacity (task consume memory as well not just CPU). Fire and forget, hence the name, you keep spawning up tasks without await
ing them.
My question is, is it safe/worth it to just initialize the new tasks from within the already running tasks to avoid wasting processor time monitoring the queue?
It's valid, nothing can prevent you from doing that, but you are already wasting resources, why to spawn new task each time? You've got already one running, just make it a long running background task and keep it running, just two threads (I assume one is (UI/user facing) thread) and one doing the work. All these locks and tasks spawning is going to hurt your performance and waste all the resources CPU memory allocations.
If you want to speed things up by parallel execution You could add the path to the concurrent queue, and have only 10-100 concurrent tasks MAX or whatever, at least you have an upper bound, you control how much the code is doing in parallel.
while conccurent-queue is not empty and no one request to cancel the operation:
- Start from base path
- Get all sub-paths and enqueue them inside the concurrent-queue
- Process files inside that path
- Make the current base path as the next item available inside the queue
- Start all over again.
You just start max number of concurrent tasks and that's it.
Your main loop/while condition is something like:
private async Task StartAsync(CancellationToken cancellationToken)
{
var tasks = new List<Task>();
for (int i = 0; i < MaxConcurrentTasks; i )
{
tasks.Add(Task.Run(() => ProcessPath(initialPathHere), cancellationToken));
}
await Task.WhenAll(tasks);
}
And then something along these lines:
private static async Task ProcessPath(string path, CancellationToken cancellationToken)
{
while(concurrentDictionary.Count > 0 && !cancellationToken.IsCancellationRequested)
{
foreach(var subPath in System.IO.Directory.EnumerateDirectories(path))
{
//Enqueue the subPath into the concurrent dictionary
}
//Once finished, process files in the current path
foreach (var file in path)
{
}
path = concurrentDictionary.Dequeue();
}
}
Haven't checked the syntax but that's how a good algorithm would do it in my opinion. Also, please keep in mind that while the task finished its current job, queue might be empty in this line, so modify that code accordingly.
path = concurrentDictionary.Dequeue();
Final notes:
- Consider the trades between tasks and Parallel.Invok/execute
- Consider using
BackgroundServices
they are fine-tuned to be long running, depends on your code and requirements - In order to work for performance gain, remember the golden rule,
measure early
. Start by measuring, have some metrics already at hand so later on if you want to speed things up a bit, you at least know how much you could do right now, so you refactor and measure again and compare and then you will know if you are getting closer or further from your target. - Make sure you do conccurency/paralell processing right, otherwise it's going to go against you not with you.