Creating plots with multiprocessing and time.strftime() doens't work properly-CodePudding

I am trying to create plots with my script running parallel using multiprocessing. I created 2 example scripts for my question here, because the actual main script with the computing part would be too long. In script0.py you can see the multiprocessing part where im starting the actual script1.py that does something 4 times in parallel. In this example it just creates some random scatterplots.

script0.py:

import multiprocessing as mp
import os

def execute(process):
    os.system(f"python {process}")



if __name__ == "__main__":

    proc_num = 4
    process= []

    for _ in range(proc_num):
        process.append("script1.py")

    process_pool = mp.Pool(processes= proc_num)
    process_pool.map(execute, process)

script1.py:

#just a random scatterplot, but works for my example
    import time
    import numpy as np
    import matplotlib.pyplot as plt
    import os
    
    dir_name = "stackoverflow_question"
    plot_name = time.strftime("Plot %Hh%Mm%Ss")      #note the time.strftime() function
    
    if not os.path.exists(f"{dir_name}"):
        os.mkdir(f"{dir_name}")
    
    N = 50
    x = np.random.rand(N)
    y = np.random.rand(N)
    colors = np.random.rand(N)
    
    area = (30 * np.random.rand(N))**2
    
    plt.scatter(x,y, s=area, c=colors, alpha=0.5)
    #plt.show()
    plt.savefig(f"{dir_name}/{plot_name}", dpi = 300)

The important thing is, that I am naming the plot by its creation time

plot_name = time.strftime("Plot %Hh%Mm%Ss")

So this creates a string like "Plot 16h39m22s". So far so good... now to my actual problem! I realized that when starting the processes in parallel, sometimes the plot names are the same because the time stamps created by time.strftime() are the same and so it can happen that one instance of script1.py overwrites the already created plot of another.

In my working script where I have this exact problem I'm generating a lot of data therefore i need to name my plots and CSVs accordingly to the date and time they were generated.

I already thought of giving a variable down to script1.py when it gets called, but I don't know how to realize that since I just learned about the multiprocessing library. But this variable had to vary as well, otherwise I would run into the same problem.

Does anybody have a better idea of how I could realize this? Thank you so much in advance.

CodePudding user response：

I propose these approaches:

Approach 1: (simple and recommended) if you can change the name, I recommend using unixtime (eg. using time.time() or time.time_ns()) instead of date or adding decimals to the seconds. This way you would make a collision almost impossible.
Approach 2: Add the process id in the filename (eg: <filename_timestamp_processid>). This way even if two processes write at the same time you will have the process id distinguishing the files. If you want to remove the id from the name at the end of execution read the filenames and do a merge, if there are collisions adjust the filename in the appropriate way.
Approach 3: like approach2, but instead of changing the name you create a folder named after the process id in which you put the outputs of that process. At the end of execution you merge the folders and correct any collisions.
Approach 4: (not recommended, difficult to manage and affects performance) shared memory. You use a variable in shared memory with the last timestamp and check that.

CodePudding user response：

Welcome to the site. A couple ideas...

First, you are not following the guidelines in multiprocessing module on how to use Pool. You should have it in a context manager, with(...)...

There are many examples out there. See the warning in red in the dox:

https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool

Also, using os.system calls is a little odd/unsafe. Why don't you just put you plotting routine into a standard function in the same module or a different module and just import it? That would allow you to pass in additional info (like a good label) to the function. I would expect something like this where source is a datafile or external source...

def make_plot(source, output_file_name, plot_label):
    # read the data source
    # make the plot
    # save it to the output path...

As far as the label is concerned, of course there is going to be overlap if you start these processes within the same "second", so you can either append the label with the process number, or some other piece of info like something from the data source, or use the same timestamp, but put the output in unique folders, as suggested in the other answer.

I would think something like this...

Code:

from multiprocessing import Pool
import time

def f(data, output_folder, label):
    # here data is just an integer, in yours, it would be the source of the graph data...
    val = data * data
    # the below is just example...  you could just use your folder making/saving routine...
    return f'now we can save {label} in folder {output_folder} with value: {val}'

if __name__ == '__main__':
    with Pool(5) as p:
        folders = ['data1', 'data2', 'data3']
        labels = [time.strftime("Plot %Hh%Mm%Ss")]*3
        x_s = [1, 2, 3]
        output = p.starmap(f, zip(x_s, folders, labels))
        for result in output:
            print(result)

Output:

now we can save Plot 08h55m17s in folder data1 with value: 1
now we can save Plot 08h55m17s in folder data2 with value: 4
now we can save Plot 08h55m17s in folder data3 with value: 9