Unable to modify or reproduce a ragged numpy array-CodePudding

** For convenience, I prepared a notebook which downloads gt.mat and sample_images.tar.gz files needed to reproduce the problem.

I'm trying to reproduce examples from a text detection repo which depend on the following dataset. The current code is expecting 3 numpy arrays of object dtype, one of which has ragged sequences.

I read the file using:

>>> from scipy.io import loadmat

>>> labels = loadmat('gt.mat')

The resulting dict contains a few keys, 3 of which are interesting to the issue being addressed charBB, imnames, and txt which contain 858750 character bounding box coordinates, image paths, and words respectively.

>>> labels.keys()
dict_keys(['__header__', '__version__', '__globals__', 'charBB', 'wordBB', 'imnames', 'txt'])
>>> {key: item.shape for (key, item) in labels.items() if isinstance(item, np.ndarray)}
{'charBB': (1, 858750),
 'wordBB': (1, 858750),
 'imnames': (1, 858750),
 'txt': (1, 858750)}
>>> labels['charBB'][0].shape
(858750,)
>>> labels['charBB'][0][0].shape  # first word character bboxes
(2, 4, 54)
>>> labels['charBB'][0][1].shape  # second word (different shape than the first's)
(2, 4, 60)

Let's say I need to select 100 images, download them and use them to run some examples locally, I'll have to extract the corresponding 100 items from each of the 3 arrays and discard the rest, which I do using:

import numpy as np
from scipy.io import loadmat
from pathlib import Path


def load_synth_data(data_dir, labels_file, keys):
    char_boxes, words, image_paths = [], [], []
    labels = {
        key: np.squeeze(value) if isinstance(value, np.ndarray) else value
        for (key, value) in loadmat(labels_file).items()
    }
    total = labels['charBB'].shape[0]
    for i, (image_char_boxes, image_words, image_path) in enumerate(
        zip(*(labels[key] for key in keys)), 1
    ):
        found = len(char_boxes)
        display = [
            f'Loading {i}/{total}',
            f'{np.around((i / total) * 100, 2)}%',
            f'found: {found}',
            f'{np.around((found / total) * 100, 2)}%',
        ]
        print(f"\r{' | '.join(display)}", end='')
        if (image_path := (Path(data_dir) / image_path.item())).exists():
            char_boxes.append(image_char_boxes)
            words.append(image_words)
            image_paths.append(image_path.as_posix())
    print()
    return char_boxes, words, image_paths

Then I use it to select the existing images which are also provided in the notebook:

subset_boxes, subset_words, subset_paths = load_synth_data('sample_images/SynthText', 'gt.mat', ['charBB', 'txt', 'imnames'])

It works fine for subset_words and subset_paths (despite them having ragged shapes):

>>> np.array(subset_words, dtype='O')
array([array(['that the\n  had   \nIraq and\nwhat the'], dtype='<U35'),
       array(['the              ', '  beep  \nwhen the', 'and              '],
             dtype='<U17')                                                    ,
       array(['to test', 'the\nthe', 'far    '], dtype='<U7'), ...,
       array(['>In  ', 'of an'], dtype='<U5'),
       array(['the'], dtype='<U3'), array(['are', 'new'], dtype='<U3')],
      dtype=object)

And fails for subset_boxes

>>> np.array(subset_boxes, dtype='O')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-6275cb854275>", line 1, in <module>
    np.array(subset_boxes, 'O')
ValueError: could not broadcast input array from shape (2,4,24) into shape (2,4)

torch repo code is expecting a numpy array similar to the one produced using scipy.io.loadmat, otherwise I get an incompatible batch shape problem which is outside the scope of my question. How can I create an array out of the selected examples, matching the expected shape (858750,)?

One workaround is to store the indices of the found items and use them to select the ones needed labels['charBB'][[..., ...]] which would solve the problem however, this won't help recreate the array from a list or whatever.

CodePudding user response：

Skipping over most of your description, the last error is produced by an action like

In [407]: np.array([np.ones([2,4,3]),np.zeros([2,4,2])],'O')
Traceback (most recent call last):
  File "<ipython-input-407-6868cb2349dc>", line 1, in <module>
    np.array([np.ones([2,4,3]),np.zeros([2,4,2])],'O')
ValueError: could not broadcast input array from shape (2,4,3) into shape (2,4)

Creating an "ragged array" fails when the leading dimension(s) are the same. np.array(...) tries to make a multidimensional array. Failing that it, the fall back is object dtype array. But a combination of dimensions like this takes it up a 'dead end', possibly because it makes the wrong "guess" as to the desired return shape.

The most reliable way to make an object dtype array from any mix of inputs is to initialize and assign:

In [408]: arr = np.empty(2,'O')
In [409]: arr
Out[409]: array([None, None], dtype=object)
In [410]: arr[:] = [np.ones([2,4,3]),np.zeros([2,4,2])]