I want to create a class that behaves like numpy arrays but possesses additional methods/attributes and have been reading and not fully understanding numpy's guide on subclassing ndarray. On that webpage there is an example that reads
import numpy as np
class RealisticInfoArray(np.ndarray):
def __new__(cls, input_array, info=None):
# Input array is an already formed ndarray instance
# We first cast to be our class type
obj = np.asarray(input_array).view(cls)
# add the new attribute to the created instance
obj.info = info
# Finally, we must return the newly created object:
return obj
def __array_finalize__(self, obj):
# see InfoArray.__array_finalize__ for comments
if obj is None: return
self.info = getattr(obj, 'info', None)
I am confused as to why the lines
obj = np.asarray(input_array).view(cls)
# add the new attribute to the created instance
obj.info = info
do not raise
AttributeError: 'numpy.ndarray' object has no attribute 'info'
I have read in Add an attribute to a Numpy array in runtime that it is related to numpy arrays being implemented in C. Is that the end of the story? How does Python "know" np.array is implemented in C and not a Python class which you can easily add new attributes to?
CodePudding user response:
C implemented classes have to go out of their way to have a __dict__
(which is where dynamically defined attributes are stored); they can do it, but they usually don't unless they're trying to simulate some other type that allows it (e.g. functools.partial
allows you to assign arbitrary attributes because regular functions allow it, and it's trying to stay compatible), because they have more efficient ways to store their predefined set of attributes (usually as raw values or pointers within the PyObject
header).
Omitting the __dict__
saves a pointer's worth of memory overhead (4-8 bytes) per instance, plus the cost of the actual dict
itself (104 bytes even for an empty __dict__
on 64 bit CPython 3.9.5). For simple types that you create many instances of, including the __dict__
when it's almost never used massively increases the overhead. For example, a CPython 3.9.5 x64 float
consumes 24 bytes to store 8 bytes of "real" data, meaning 16 bytes is overhead; if it allowed arbitrary attribute assignment, the overhead would jump from 16 to 24 bytes even if __dict__
was created lazily, and if it wasn't created lazily (to speed other code by removing a check for "allows __dict__
but it might not be initialized yet" that would have to be performed on every access) the overhead would jump from 24 to 128 bytes (plus twice the opportunities for allocator overhead to waste bytes that aren't strictly allocated, but get lost to round-off and fragmentation issues), all for just 8 bytes of "real" data. Storing five million float
s would go the raw C cost of 40 MB to the __dict__
-less CPython cost of 120 MB (ignoring the container that actually holds them; that would add at least 40 MB) to 680 MB, all on the off-chance you might want to define an arbitrary attribute on one of them.
User-defined classes on the other hand have __dict__
by default (it's the only place they store attributes by default, whether defined in __init__
or added by consumers of the class manually), and only omit it when the class, and all its parent classes, define a class-level __slots__
(and only if all of them omit '__dict__'
from their __slots__
).
To answer your specific question "How does Python "know" np.array is implemented in C and not a Python class which you can easily add new attributes to?", at least for CPython, it tests for tp_dictoffset
on the instance's class being non-zero; if it's zero, that class's instances lack __dict__
and it's not legal to add arbitrary attributes, if it's non-zero, it tells the interpreter how many bytes from the start (or end, when negative) of the PyObject
header it needs to look to find the __dict__
pointer. tp_dictoffset
is initialized when the class is defined, manually in the case of C implemented classes that want to support arbitrary attributes, and on your behalf by the interpreter machinery for user-defined classes.