I have some data that needs to be processed. The data is a tree. The processing goes like this: Take a node N. Check if all of its children have already been processed. If not, process them first. If yes, process N. So we go from top to bottom (recursively) to the leaves, then process leaves, then the leaves' parent nodes and so on, upwards until we arrive at the root again.
I know how to write a program that runs on ONE computer that takes the data (i.e. the root node) and processes it as described above. Here is a sketch in C#:
// We assume data is already there, so I do not provide constructor/setters.
public class Data
{
public object OwnData { get; }
public IList<Data> Children { get; }
}
// The main class. We just need to call Process once and wait for it to finish.
public class DataManager
{
internal ISet<Data> ProcessedData { get; init; }
public DataManager()
{
ProcessedData = new HashSet<Data>();
}
public void Process(Data rootData)
{
new DataHandler(this).Process(rootData);
}
}
// The handler class that processes data recursively by spawning new instances.
// It informs the manager about data processed.
internal class DataHandler
{
private readonly DataManager Manager;
internal DataHandler(ProcessManager manager)
{
Manager = manager;
}
internal void Process(Data data)
{
if (Manager.ProcessedData.Contains(data))
return;
foreach (var subData in data.Children)
new DataHandler(Manager).Process(subData);
... // do some processing of OwnData
Manager.ProcessedData.Add(data);
}
}
But how can I write the program so that I can distribute the work to a pool of computers (that are all in the same network, either some local one or the internet)? What do I need to do for that?
Some thoughts/ideas:
- The
DataManager
should run on one computer (the main one / the sever?); the DataHandlers should run on all the others (the clients?). - The
DataManager
needs to know the computers by some id (what id would that be?) which are set during construction ofDataManager
. - The
DataManager
must be able to create new instances ofDataHandler
(or kill them if something goes wrong) on these computers. How? - The
DataManager
must know which computers currently have a running instance ofDataHandler
and which not, so that it can decide on which computer it can spawn the nextDataHandler
(or, if none is free, wait).
These are not requirements! I do not know if these ideas are viable.
In the above thoughts I assumed that each computer can just have one instance of DataHandler
. I know this is not necessarily so (because CPU cores and threads...), but in my use case it might actually be that way: The real DataManager
and DataHandler
are not standalone but run in a SolidWorks context. So in order to run any of that code, I need to have a running SolidWorks instance. From my experience, more than one SolidWorks instance on the same Windows does not work (reliably).
From my half-knowledge it looks like what I need is a kind of multi-computer-OS: In a single-computer-setting, the points 2, 3 and 4 are usually taken care of by the OS. And point 1 kind of is the OS (the OS=DataManager
spawns processes=DataHandlers
; the OS keeps track of data=ProcessedData
and the processes report back).
What exactly do I want to know?
- Hints to words, phrases or introductory articles that allow me to dive into the topic (in order to become able to implement this). Possibly language-agnostic.
- Hints to C# libraries/frameworks that are fit for this situation.
- Tips on what I should or shouldn't do (typical beginners issues). Possibly language-agnostic.
- Links to example/demonstration C# projects, e.g. on GitHub. (If not C#, VB is also alright.)
CodePudding user response:
You should read up on microservices and queues. Like rabbitmq. The producer/ consumer approach.
https://www.rabbitmq.com/getstarted.html
If you integrate your microservices with Docker, you can do some pretty nifty stuff.