Home > database >  Why does directly adding a new attribute to numpy arrays not work but doing so by subclassing does?
Why does directly adding a new attribute to numpy arrays not work but doing so by subclassing does?

Time:11-13

I want to create a class that behaves like numpy arrays but possesses additional methods/attributes and have been reading and not fully understanding numpy's guide on subclassing ndarray. On that webpage there is an example that reads

import numpy as np

class RealisticInfoArray(np.ndarray):

    def __new__(cls, input_array, info=None):
        # Input array is an already formed ndarray instance
        # We first cast to be our class type
        obj = np.asarray(input_array).view(cls)
        # add the new attribute to the created instance
        obj.info = info
        # Finally, we must return the newly created object:
        return obj

    def __array_finalize__(self, obj):
        # see InfoArray.__array_finalize__ for comments
        if obj is None: return
        self.info = getattr(obj, 'info', None)

I am confused as to why the lines

        obj = np.asarray(input_array).view(cls)
        # add the new attribute to the created instance
        obj.info = info

do not raise

AttributeError: 'numpy.ndarray' object has no attribute 'info'

I have read in Add an attribute to a Numpy array in runtime that it is related to numpy arrays being implemented in C. Is that the end of the story? How does Python "know" np.array is implemented in C and not a Python class which you can easily add new attributes to?

CodePudding user response:

C implemented classes have to go out of their way to have a __dict__ (which is where dynamically defined attributes are stored); they can do it, but they usually don't unless they're trying to simulate some other type that allows it (e.g. functools.partial allows you to assign arbitrary attributes because regular functions allow it, and it's trying to stay compatible), because they have more efficient ways to store their predefined set of attributes (usually as raw values or pointers within the PyObject header).

Omitting the __dict__ saves a pointer's worth of memory overhead (4-8 bytes) per instance, plus the cost of the actual dict itself (104 bytes even for an empty __dict__ on 64 bit CPython 3.9.5). For simple types that you create many instances of, including the __dict__ when it's almost never used massively increases the overhead. For example, a CPython 3.9.5 x64 float consumes 24 bytes to store 8 bytes of "real" data, meaning 16 bytes is overhead; if it allowed arbitrary attribute assignment, the overhead would jump from 16 to 24 bytes even if __dict__ was created lazily, and if it wasn't created lazily (to speed other code by removing a check for "allows __dict__ but it might not be initialized yet" that would have to be performed on every access) the overhead would jump from 24 to 128 bytes (plus twice the opportunities for allocator overhead to waste bytes that aren't strictly allocated, but get lost to round-off and fragmentation issues), all for just 8 bytes of "real" data. Storing five million floats would go the raw C cost of 40 MB to the __dict__-less CPython cost of 120 MB (ignoring the container that actually holds them; that would add at least 40 MB) to 680 MB, all on the off-chance you might want to define an arbitrary attribute on one of them.

User-defined classes on the other hand have __dict__ by default (it's the only place they store attributes by default, whether defined in __init__ or added by consumers of the class manually), and only omit it when the class, and all its parent classes, define a class-level __slots__ (and only if all of them omit '__dict__' from their __slots__).

To answer your specific question "How does Python "know" np.array is implemented in C and not a Python class which you can easily add new attributes to?", at least for CPython, it tests for tp_dictoffset on the instance's class being non-zero; if it's zero, that class's instances lack __dict__ and it's not legal to add arbitrary attributes, if it's non-zero, it tells the interpreter how many bytes from the start (or end, when negative) of the PyObject header it needs to look to find the __dict__ pointer. tp_dictoffset is initialized when the class is defined, manually in the case of C implemented classes that want to support arbitrary attributes, and on your behalf by the interpreter machinery for user-defined classes.

  • Related