You will need to download fonts.zip and extract it in the same folder as the code for the examples to run.
The purpose of this code is to generate random text, render and save text as image. The code accepts letters
and numbers
which are the populations of letters and numbers respectively to generate text from. It also accepts character_frequency
which determines how many instances of each character will be generated. Then generates a long string, and split it to random size substrings stored in TextGenerator.dataset
attribute which results from TextGenerator.initialize_dataset
.
Ex: for letters = 'abc', numbers = '123', character frequency = 3, 'aaabbbccc111222333' is generated, shuffled, and split to random size substrings ex: ['a312c', '1b1', 'bba32c3a2c'].
Then each word will be rendered and saved as image which results from TextGenerator.save_images
which is the subject of this question.
There is executor
parameter which will be concurrent.futures.ThreadPoolExecutor
and concurrent.futures.ProcessPoolExecutor
passed to TextGenerator
in the examples shown below for demonstration purposes.
What is the issue?
The more character_frequency
is increased, the longer the dataset stored in TextGenerator.dataset
will be however, it shouldn't affect performance. What actually happens: the more character_frequency
, the more time TextGenerator.save_images
requires to finish with concurrent.futures.ProcessPoolExecutor
. On the other hand, with everything remaining the same, and passing concurrent.futures.ThreadPoolExecutor
instead, time required is constant, and is not affected by character_frequency
.
import random
import string
import tempfile
import textwrap
from concurrent.futures import (ProcessPoolExecutor, ThreadPoolExecutor,
as_completed)
from pathlib import Path
from time import perf_counter
import numpy as np
import pandas as pd
from cv2 import cv2
from PIL import Image, ImageDraw, ImageFont
class TextGenerator:
def __init__(
self,
fonts,
character_frequency,
executor,
max_images=None,
background_colors=((255, 255, 255),),
font_colors=((0, 0, 0),),
font_sizes=(25,),
max_example_size=25,
min_example_size=1,
max_chars_per_line=80,
output_dir='data',
workers=1,
split_letters=False,
):
assert (
min_example_size > 0
), f'`min_example_size` should be > 0`, got {min_example_size}'
assert (
max_example_size > 0
), f'`max_example_size` should be > 0`, got {max_example_size}'
self.fonts = fonts
self.character_frequency = character_frequency
self.executor = executor
self.max_images = max_images
self.background_colors = background_colors
self.font_colors = font_colors
self.font_sizes = font_sizes
self.max_example_size = max_example_size
self.min_example_size = min_example_size
self.max_chars_per_line = max_chars_per_line
self.output_dir = Path(output_dir)
self.workers = workers
self.split_letters = split_letters
self.digits = len(f'{character_frequency}')
self.max_font = max(font_sizes)
self.generated_labels = []
self.dataset = []
self.dataset_size = 0
def render_text(self, text_lines):
font = random.choice(self.fonts)
font_size = random.choice(self.font_sizes)
background_color = random.choice(self.background_colors)
font_color = random.choice(self.font_colors)
max_width, total_height = 0, 0
font = ImageFont.truetype(font, font_size)
line_sizes = {}
for line in text_lines:
width, height = font.getsize(line)
line_sizes[line] = width, height
max_width = max(width, max_width)
total_height = height
image = Image.new('RGB', (max_width, total_height), background_color)
draw = ImageDraw.Draw(image)
current_height = 0
for line_text, dimensions in line_sizes.items():
draw.text((0, current_height), line_text, font_color, font=font)
current_height = dimensions[1]
return np.array(image)
def display_progress(self, example_idx):
print(
f'\rGenerating example {example_idx 1}/{self.dataset_size}',
end='',
)
def generate_example(self, text_lines, example_idx):
text_box = self.render_text(text_lines)
filename = (self.output_dir / f'{example_idx:0{self.digits}d}.jpg').as_posix()
cv2.imwrite(filename, text_box)
return filename, text_lines
def create_dataset_pool(self, executor, example_idx):
future_items = []
for j in range(self.workers):
if not self.dataset:
break
text = self.dataset.pop()
if text.strip():
text_lines = textwrap.wrap(text, self.max_chars_per_line)
future_items.append(
executor.submit(
self.generate_example,
text_lines,
j example_idx,
)
)
return future_items
def write_images(self):
i = 0
with self.executor(self.workers) as executor:
while i < self.dataset_size:
future_items = self.create_dataset_pool(executor, i)
for future_item in as_completed(future_items):
filename, text_lines = future_item.result()
if filename:
self.generated_labels.append(
{'filename': filename, 'label': '\n'.join(text_lines)}
)
self.display_progress(i)
i = min(self.workers, self.dataset_size - i)
if self.max_images and i >= self.max_images:
break
def initialize_dataset(self, letters, numbers, space_freq):
for characters in letters, numbers:
dataset = list(
''.join(
letter * self.character_frequency
for letter in characters ' ' * space_freq
)
)
random.shuffle(dataset)
self.dataset.extend(dataset)
i = 0
temp_dataset = []
min_split_example_size = min(self.max_example_size, self.max_chars_per_line)
total_letters = len(self.dataset)
while i < total_letters - self.min_example_size:
example_size = random.randint(self.min_example_size, self.max_example_size)
example = ''.join(self.dataset[i : i example_size])
temp_dataset.append(example)
i = example_size
if self.split_letters:
split_example = ' '.join(list(example))
for sub_example in textwrap.wrap(split_example, min_split_example_size):
if (sub_example_size := len(sub_example)) >= self.min_example_size:
temp_dataset.append(sub_example)
i = sub_example_size
self.dataset = temp_dataset
self.dataset_size = len(self.dataset)
def generate(self, letters, numbers, space_freq, fp='labels.csv'):
self.output_dir.mkdir(parents=True, exist_ok=True)
self.initialize_dataset(letters, numbers, space_freq)
t1 = perf_counter()
self.write_images()
t2 = perf_counter()
print(
f'\ntotal time: {t2 - t1} seconds, character frequency '
f'specified: {self.character_frequency}, type: {self.executor.__name__}'
)
pd.DataFrame(self.generated_labels).to_csv(self.output_dir / fp, index=False)
if __name__ == '__main__':
out = Path(tempfile.mkdtemp())
total_images = 15
for char_freq in [100, 1000, 1000000]:
for ex in [ThreadPoolExecutor, ProcessPoolExecutor]:
g = TextGenerator(
[
p.as_posix()
for p in Path('fonts').glob('*.ttf')
],
char_freq,
ex,
max_images=total_images,
output_dir=out,
max_example_size=15,
min_example_size=5,
)
g.generate(string.ascii_letters, '0123456789', 1)
Which produces the following results on my i5 mbp:
Generating example 15/649
total time: 0.0652076720000001 seconds, character frequency specified: 100, type: ThreadPoolExecutor
Generating example 15/656
total time: 1.1637316500000001 seconds, character frequency specified: 100, type: ProcessPoolExecutor
Generating example 15/6442
total time: 0.06430166800000015 seconds, character frequency specified: 1000, type: ThreadPoolExecutor
Generating example 15/6395
total time: 1.2626316840000005 seconds, character frequency specified: 1000, type: ProcessPoolExecutor
Generating example 15/6399805
total time: 0.05754961300000616 seconds, character frequency specified: 1000000, type: ThreadPoolExecutor
Generating example 15/6399726
total time: 45.18768219699999 seconds, character frequency specified: 1000000, type: ProcessPoolExecutor
0.05 seconds (threads) vs 45 seconds (processes) to save 15 images with character_frequency
= 1000000. Why is it taking so long? and why is it affected by character_frequency
value? which is independent and should only affect the initialization time (which is exactly what happens with threads)
CodePudding user response:
Assuming I interpreted correctly your code, you are generating sample text which size is controlled by the character_frequency
value. The greater the value, the longer the text.
The text is generated in the main loop of your program. Then you schedule a set of tasks which receive said text and generate an image based on it.
As processes live in separate memory address spaces, the text needs to be sent to them through a pipe. This pipe is the bottleneck which is affecting your performance. The reason why you see performance deteriorating at the growth of character_frequency
is because more text needs to be serialized and sent through said pipe sequentially. Your workers are starving as they wait for the data to arrive.
This issue does not affect your Pool of Threads as threads live in the same memory address space of the main process. Hence, the data does not need to be serialized and sent across your Operating System.
To speed up your program while using processes you can either move the text generation logic in the worker itself or write said text in one or more files. You then let the worker processes themselves opening these files so you can leverage I/O parallelization. All your main process does is pointing the workers to the right file position or file name.