Is there a simple way to remove "padding" fields from numpy.dtype.descr?-CodePudding

Context

Since numpy version 1.16, if you access multiple fields of a structured array, the dtype of the resulting array will have the same item size as the original one, leading to extra "padding":

The new behavior as of Numpy 1.16 leads to extra “padding” bytes at the location of unindexed fields compared to 1.15. You will need to update any code which depends on the data having a “packed” layout.

This can lead to issues, e.g. if you want to add fields to the array in question later-on:

import numpy as np
import numpy.lib.recfunctions


a = np.array(
    [
        (10.0, 13.5, 1248, -2),
        (20.0, 0.0, 0, 0),
        (30.0, 0.0, 0, 0),
        (40.0, 0.0, 0, 0),
        (50.0, 0.0, 0, 999)
    ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
)  # some array stolen from here: https://stackoverflow.com/a/37081693/5472354
print(a.shape, a.dtype, a.dtype.names, a.dtype.descr)
# all good so far

b = a[['x', 'i']]  # for further processing I only need certain fields
print(b.shape, b.dtype, b.dtype.names, b.dtype.descr)
# you will only notice the extra padding in the descr

# b = np.lib.recfunctions.repack_fields(b)
# workaround

# now when I add fields, this becomes an issue
c = np.empty(b.shape, dtype=b.dtype.descr   [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1

print(c.dtype.names)
print(c['f1'])
# the void fields are filled with raw data and were given proper names
# that can be accessed

Now a workaround would be to use numpy.lib.recfunctions.repack_fields, which removes the padding, and I will use this in the future, but for my previous code, I need a fix. (Though there can be issues with recfunctions, as the module may not be found; as is the case for me, thus the additional import numpy.lib.recfunctions statement.)

Question

This part of the code is what I used to add fields to an array (based on this):

c = np.empty(b.shape, dtype=b.dtype.descr   [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1

Though (now that I know of it) using numpy.lib.recfunctions.require_fields may be more appropriate to add the fields. However, I would still need a way to remove the empty fields from b.dtype.descr:

[('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]

This is just a list of tuples, so I guess I could construct a more or less awkward way (along the lines of descr.remove(('', '|V8'))) to deal with this, but I was wondering if there is a better way, especially since the size of the voids depends on the number of left-out fields, e.g. from V8 to V16 if there are two in a row and so on (instead of a new void for each left-out field). So the code would become real clunky real fast.

CodePudding user response：

In [237]: a = np.array(
     ...:     [
     ...:         (10.0, 13.5, 1248, -2),
     ...:         (20.0, 0.0, 0, 0),
     ...:         (30.0, 0.0, 0, 0),
     ...:         (40.0, 0.0, 0, 0),
     ...:         (50.0, 0.0, 0, 999)
     ...:     ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
     ...:     )
In [238]: a
Out[238]: 
array([(10., 13.5, 1248,  -2), (20.,  0. ,    0,   0),
       (30.,  0. ,    0,   0), (40.,  0. ,    0,   0),
       (50.,  0. ,    0, 999)],
      dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')])

the b view:

In [240]: b = a[['x','i']]
In [241]: b
Out[241]: 
array([(10., 1248), (20.,    0), (30.,    0), (40.,    0), (50.,    0)],
      dtype={'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})

the repacked copy:

In [243]: c = rf.repack_fields(b)
In [244]: c
Out[244]: 
array([(10., 1248), (20.,    0), (30.,    0), (40.,    0), (50.,    0)],
      dtype=[('x', '<f8'), ('i', '<i8')])
In [245]: c.dtype
Out[245]: dtype([('x', '<f8'), ('i', '<i8')])

your overly padded attempt at adding a field:

In [247]: d = np.empty(b.shape, dtype=b.dtype.descr   [('c', 'i4')])
     ...: d[list(b.dtype.names)] = b
     ...: d['c'] = 1
In [248]: d
Out[248]: 
array([(10., b'\x00\x00\x00\x00\x00\x00\x00\x00', 1248, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
       (20., b'\x00\x00\x00\x00\x00\x00\x00\x00',    0, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
       ...],
      dtype=[('x', '<f8'), ('f1', 'V8'), ('i', '<i8'), ('f3', 'V8'), ('c', '<i4')])

My first attempt at making a dtype that does not include the Void fields. I don't know simply testing for V is robust enough:

In [253]: [des for des in b.dtype.descr if not 'V' in des[1]]
Out[253]: [('x', '<f8'), ('i', '<i8')]

And make a new dtype from that:

In [254]: d_dtype = _   [('c','i4')]

All of this is normal python list and tuple manipulation. I've seen that in other recfunctions. I suspect repack_fields does something like this.

Now we make a new array with the simpler dtype:

In [255]: d = np.empty(b.shape, dtype=d_dtype)
In [256]: d[list(b.dtype.names)] = b
     ...: d['c'] = 1
In [257]: d
Out[257]: 
array([(10., 1248, 1), (20.,    0, 1), (30.,    0, 1), (40.,    0, 1),
       (50.,    0, 1)], dtype=[('x', '<f8'), ('i', '<i8'), ('c', '<i4')])

I've extracted from repack_fields the code that constructs a new, un-padded, dtype:

In [262]: def foo(a):
     ...:     fieldinfo = []
     ...:     for name in a.names:
     ...:         tup = a.fields[name]
     ...:         fmt = tup[0]
     ...:         if len(tup) == 3:
     ...:             name = (tup[2], name)
     ...:         fieldinfo.append((name, fmt))
     ...:     print(fieldinfo)
     ...:     dt = np.dtype(fieldinfo)
     ...:     return dt
     ...: 
     ...: 
In [263]: foo(b.dtype)
[('x', dtype('float64')), ('i', dtype('int64'))]
Out[263]: dtype([('x', '<f8'), ('i', '<i8')])

This works from dtype.fields rather than the dtype.descr. One's a dict the other a list.

In [274]: b.dtype
Out[274]: dtype({'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
In [275]: b.dtype.descr
Out[275]: [('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
In [276]: b.dtype.fields
Out[276]: mappingproxy({'x': (dtype('float64'), 0), 'i': (dtype('int64'), 16)})
In [277]: b.dtype.fields['x']
Out[277]: (dtype('float64'), 0)

another way of getting just the valid descr tuples from b.dtype:

In [278]: [des for des in b.dtype.descr if des[0] in b.dtype.names]
Out[278]: [('x', '<f8'), ('i', '<i8')]