Create Multiple Matplotlib Plots in Parallel-CodePudding

Consider the following script to generate plots in Matplotlib:

from matplotlib import font_manager as fm, get_data_path as datapath, pyplot as plt, colors as clr, figure as fig, patches as mpatches, rcParams, gridspec
import cartopy.crs as crs
import cartopy.feature as cfeature
from cartopy.io.shapereader import Reader
from cartopy.feature import ShapelyFeature

states = '/some/path/usstates.shp'
states_feature = ShapelyFeature(Reader(states).geometries(), crs.LambertConformal(), facecolor='none', edgecolor='black')

fig = plt.figure(figsize=(14,9))
gs = gridspec.GridSpec(ncols=1, nrows=2, width_ratios=[1], height_ratios=[0.15, 3.00])
gs.update(wspace=0.00, hspace=0.00)

ax1 = fig.add_subplot(gs[0, :])
plt.text(0.00, 0.50, "Some Fun Text", fontsize=15)
plt.text(1.00, 0.50, "Some Other Fun Text", fontsize=15, ha='right')

ax2 = fig.add_subplot(gs[1, :], projection=crs.LambertConformal())
ax2.set_extent(['region coords, not required for question'], crs=crs.LambertConformal())
ax2.add_feature(states_feature, linewidth=1.25)

plt.show()

I've encountered the requirement to generate plots over multiple domains, or where the function ax2.set_extent() will be passed multiple sets of lat/lon bounds. The total number of sets is large, to where generating these plots one at a time is grossly inefficient.

The current solution I've implemented is to run this script multiple times in parallel and pass these pairs in pre-compiled groups. However, this has become largely inefficient and occupies large amounts of memory, particularly if the shapefile is multiple MBs large. This occurs from loading the shapefile multiple times across the cumulative script executions.

Is there an effective way to generate these plots in parallel, where components like shapefiles are only loaded into memory once?

CodePudding user response：

Following the link provided by Jody Klymak, one can accomplish this through the use of Python's multiprocessing library. The other question referenced provides some useful information on where to get started, but I want to provide some clarification for other users. First, it's important to define two functions in your script: one for plotting and a second for parallelization. Using code from my initial question, this would look as follows:

import multiprocessing as mpr
boundslist = ['some list of plotting boundaries, grouped as lists']

def plots(bounds):
    fig = plt.figure(figsize=(14,9))
    gs = gridspec.GridSpec(ncols=1, nrows=2, width_ratios=[1], height_ratios=[0.15, 3.00])
    gs.update(wspace=0.00, hspace=0.00)
    
    ax1 = fig.add_subplot(gs[0, :])
    plt.text(0.00, 0.50, "Some Fun Text", fontsize=15)
    plt.text(1.00, 0.50, "Some Other Fun Text", fontsize=15, ha='right')
    
    ax2 = fig.add_subplot(gs[1, :], projection=crs.LambertConformal())
    ax2.set_extent([bounds], crs=crs.LambertConformal())
    ax2.add_feature(states_feature, linewidth=1.25)
    plt.savefig('/save/to/somepath/funplot.png', dpi=125, bbox_inches="tight", pad_inches=0.05)
    plt.close()

def parallel(domain):
    processes = []
    for domain in domains:
        pro = mpr.Process(target=plots, args=(domain))
        processes.extend([pro])
    for p in processes:
        p.start()
    for p in processes:
        p.join()

def main():
    parallel(boundslist)

if __name__ == "__main__":
    main()

Note that in addition to the previous question, I am pooling multiprocess IDs together into a list processes, executing them at the conclusion of the list ingestion p.start(), then indicating to Multiprocess to wait for all pooled processes to finish p.join() before proceeding. This is not dissimilar to how pooling workflow is executed using threader in Python. You may want to govern how many plots are generated, which can be done either with the loop the parallel() function or at initial list construction.

An important note: plotting gridded data with Multiprocess can get interesting, as I discovered while using rioxarray. If you plan to only read and not write data, I highly suggest turning off file locks rioxarray.open_rasterio(file, lock=False).

What about speed? After running some timed tests, I found a 33-45% script speed increase with a roughly ~50% decrease in memory use. This will vary depending on what you're doing, but will be most notable when using larger datasets.