Same code, multithreading is 900x faster than multiprocessing-CodePudding

You will need to download fonts.zip and extract it in the same folder as the code for the examples to run.

The purpose of this code is to generate random text, render and save text as image. The code accepts letters and numbers which are the populations of letters and numbers respectively to generate text from. It also accepts character_frequency which determines how many instances of each character will be generated. Then generates a long string, and split it to random size substrings stored in TextGenerator.dataset attribute which results from TextGenerator.initialize_dataset.

Ex: for letters = 'abc', numbers = '123', character frequency = 3, 'aaabbbccc111222333' is generated, shuffled, and split to random size substrings ex: ['a312c', '1b1', 'bba32c3a2c'].

Then each word will be rendered and saved as image which results from TextGenerator.save_images which is the subject of this question.

There is executor parameter which will be concurrent.futures.ThreadPoolExecutor and concurrent.futures.ProcessPoolExecutor passed to TextGenerator in the examples shown below for demonstration purposes.

What is the issue?

The more character_frequency is increased, the longer the dataset stored in TextGenerator.dataset will be however, it shouldn't affect performance. What actually happens: the more character_frequency, the more time TextGenerator.save_images requires to finish with concurrent.futures.ProcessPoolExecutor. On the other hand, with everything remaining the same, and passing concurrent.futures.ThreadPoolExecutor instead, time required is constant, and is not affected by character_frequency.

import random
import string
import tempfile
import textwrap
from concurrent.futures import (ProcessPoolExecutor, ThreadPoolExecutor,
                                as_completed)
from pathlib import Path
from time import perf_counter

import numpy as np
import pandas as pd
from cv2 import cv2
from PIL import Image, ImageDraw, ImageFont


class TextGenerator:
    def __init__(
        self,
        fonts,
        character_frequency,
        executor,
        max_images=None,
        background_colors=((255, 255, 255),),
        font_colors=((0, 0, 0),),
        font_sizes=(25,),
        max_example_size=25,
        min_example_size=1,
        max_chars_per_line=80,
        output_dir='data',
        workers=1,
        split_letters=False,
    ):
        assert (
            min_example_size > 0
        ), f'`min_example_size` should be > 0`, got {min_example_size}'
        assert (
            max_example_size > 0
        ), f'`max_example_size` should be > 0`, got {max_example_size}'
        self.fonts = fonts
        self.character_frequency = character_frequency
        self.executor = executor
        self.max_images = max_images
        self.background_colors = background_colors
        self.font_colors = font_colors
        self.font_sizes = font_sizes
        self.max_example_size = max_example_size
        self.min_example_size = min_example_size
        self.max_chars_per_line = max_chars_per_line
        self.output_dir = Path(output_dir)
        self.workers = workers
        self.split_letters = split_letters
        self.digits = len(f'{character_frequency}')
        self.max_font = max(font_sizes)
        self.generated_labels = []
        self.dataset = []
        self.dataset_size = 0

    def render_text(self, text_lines):
        font = random.choice(self.fonts)
        font_size = random.choice(self.font_sizes)
        background_color = random.choice(self.background_colors)
        font_color = random.choice(self.font_colors)
        max_width, total_height = 0, 0
        font = ImageFont.truetype(font, font_size)
        line_sizes = {}
        for line in text_lines:
            width, height = font.getsize(line)
            line_sizes[line] = width, height
            max_width = max(width, max_width)
            total_height  = height
        image = Image.new('RGB', (max_width, total_height), background_color)
        draw = ImageDraw.Draw(image)
        current_height = 0
        for line_text, dimensions in line_sizes.items():
            draw.text((0, current_height), line_text, font_color, font=font)
            current_height  = dimensions[1]
        return np.array(image)

    def display_progress(self, example_idx):
        print(
            f'\rGenerating example {example_idx   1}/{self.dataset_size}',
            end='',
        )

    def generate_example(self, text_lines, example_idx):
        text_box = self.render_text(text_lines)
        filename = (self.output_dir / f'{example_idx:0{self.digits}d}.jpg').as_posix()
        cv2.imwrite(filename, text_box)
        return filename, text_lines

    def create_dataset_pool(self, executor, example_idx):
        future_items = []
        for j in range(self.workers):
            if not self.dataset:
                break
            text = self.dataset.pop()
            if text.strip():
                text_lines = textwrap.wrap(text, self.max_chars_per_line)
                future_items.append(
                    executor.submit(
                        self.generate_example,
                        text_lines,
                        j   example_idx,
                    )
                )
        return future_items

    def write_images(self):
        i = 0
        with self.executor(self.workers) as executor:
            while i < self.dataset_size:
                future_items = self.create_dataset_pool(executor, i)
                for future_item in as_completed(future_items):
                    filename, text_lines = future_item.result()
                    if filename:
                        self.generated_labels.append(
                            {'filename': filename, 'label': '\n'.join(text_lines)}
                        )
                    self.display_progress(i)
                i  = min(self.workers, self.dataset_size - i)
                if self.max_images and i >= self.max_images:
                    break

    def initialize_dataset(self, letters, numbers, space_freq):
        for characters in letters, numbers:
            dataset = list(
                ''.join(
                    letter * self.character_frequency
                    for letter in characters   ' ' * space_freq
                )
            )
            random.shuffle(dataset)
            self.dataset.extend(dataset)
        i = 0
        temp_dataset = []
        min_split_example_size = min(self.max_example_size, self.max_chars_per_line)
        total_letters = len(self.dataset)
        while i < total_letters - self.min_example_size:
            example_size = random.randint(self.min_example_size, self.max_example_size)
            example = ''.join(self.dataset[i : i   example_size])
            temp_dataset.append(example)
            i  = example_size
            if self.split_letters:
                split_example = ' '.join(list(example))
                for sub_example in textwrap.wrap(split_example, min_split_example_size):
                    if (sub_example_size := len(sub_example)) >= self.min_example_size:
                        temp_dataset.append(sub_example)
                        i  = sub_example_size
        self.dataset = temp_dataset
        self.dataset_size = len(self.dataset)

    def generate(self, letters, numbers, space_freq, fp='labels.csv'):
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.initialize_dataset(letters, numbers, space_freq)
        t1 = perf_counter()
        self.write_images()
        t2 = perf_counter()
        print(
            f'\ntotal time: {t2 - t1} seconds, character frequency '
            f'specified: {self.character_frequency}, type: {self.executor.__name__}'
        )
        pd.DataFrame(self.generated_labels).to_csv(self.output_dir / fp, index=False)


if __name__ == '__main__':
    out = Path(tempfile.mkdtemp())
    total_images = 15
    for char_freq in [100, 1000, 1000000]:
        for ex in [ThreadPoolExecutor, ProcessPoolExecutor]:
            g = TextGenerator(
                [
                    p.as_posix()
                    for p in Path('fonts').glob('*.ttf')
                ],
                char_freq,
                ex,
                max_images=total_images,
                output_dir=out,
                max_example_size=15,
                min_example_size=5,
            )
            g.generate(string.ascii_letters, '0123456789', 1)

Which produces the following results on my i5 mbp:

Generating example 15/649
total time: 0.0652076720000001 seconds, character frequency specified: 100, type: ThreadPoolExecutor
Generating example 15/656
total time: 1.1637316500000001 seconds, character frequency specified: 100, type: ProcessPoolExecutor
Generating example 15/6442
total time: 0.06430166800000015 seconds, character frequency specified: 1000, type: ThreadPoolExecutor
Generating example 15/6395
total time: 1.2626316840000005 seconds, character frequency specified: 1000, type: ProcessPoolExecutor
Generating example 15/6399805
total time: 0.05754961300000616 seconds, character frequency specified: 1000000, type: ThreadPoolExecutor
Generating example 15/6399726
total time: 45.18768219699999 seconds, character frequency specified: 1000000, type: ProcessPoolExecutor

0.05 seconds (threads) vs 45 seconds (processes) to save 15 images with character_frequency = 1000000. Why is it taking so long? and why is it affected by character_frequency value? which is independent and should only affect the initialization time (which is exactly what happens with threads)

CodePudding user response：

Assuming I interpreted correctly your code, you are generating sample text which size is controlled by the character_frequency value. The greater the value, the longer the text.

The text is generated in the main loop of your program. Then you schedule a set of tasks which receive said text and generate an image based on it.

As processes live in separate memory address spaces, the text needs to be sent to them through a pipe. This pipe is the bottleneck which is affecting your performance. The reason why you see performance deteriorating at the growth of character_frequency is because more text needs to be serialized and sent through said pipe sequentially. Your workers are starving as they wait for the data to arrive.

This issue does not affect your Pool of Threads as threads live in the same memory address space of the main process. Hence, the data does not need to be serialized and sent across your Operating System.

To speed up your program while using processes you can either move the text generation logic in the worker itself or write said text in one or more files. You then let the worker processes themselves opening these files so you can leverage I/O parallelization. All your main process does is pointing the workers to the right file position or file name.