Creating Python modules from an extension with extra C data structures-CodePudding

I'm working on a custom Python loader that creates Python modules from a particular kind of non-Python file, let's call it a "cheese file". I'm writing my project as a C extension module because these "cheese files" need to be processed by a fairly complex C library (and secondarily, as a way to learn about the Python/C API).

When processing a cheese file, the C library allocates some data structures which will need to be deallocated after the Python module object is deleted. My question is, how can/should I store those C data structures along with the Python module object?

Some ideas I had:

What I would have thought was the cleanest option would be to subclass Python's ModuleType from C and adding a field into the subclass to store the data structures. The extension types tutorial is apparently quite clear about how to subclass a built-in Python type in C, so I expected to be able to do something like this:
```
typedef struct {
    PyModuleObject module;
    struct cheese_data* data;
} PyCheeseModule;

static PyTypeObject PyCheeseModule_Type = {
    PyVarObject_HEAD_INIT(NULL, 0)
    .tp_basicsize = sizeof(PyCheeseModule),
    .tp_flags = Py_TPFLAGS_DEFAULT,
    .tp_base = &PyModule_Type,
    /* other fields */
};
```
But PyModuleObject is not exposed as part of the Python/C API, so I can't use it in this way.
I also considered dynamically allocating a PyModuleDef for each cheese module, and using PyModule_FromDefAndSpec() along with PyModule_ExecDef() to create and execute the actual module. But I'm not really sure if PyModuleDef is supposed to be used this way, since the documentation only demonstrates it with statically-defined C extension modules, and anyway I would have to deallocate the PyModuleDef object itself which kind of brings me back to the same problem.
Yet another approach is to wrap the C data structure in a Python object and just add it to the module's dict. But then there's a risk that Python code would change or unset that attribute. I suppose I can find a way to work around that, but it seems pretty inelegant.
Maybe I should just be using Cython for this project. But even so, if this problem is solvable in Cython, it should also be possible to do it using the plain old Python/C API, and at least for educational value I'd like to know how.

If necessary, I can expand on these attempts with some additional code.

CodePudding user response：

A couple of options:

Use a heap type instead of a module?

It's reasonably likely that you don't actually need a module object. Module objects have two main features - they have a __name__ and they have a modifiable namespace (i.e. a __dict__) but those can be fulfilled by any extension type.

In most of the places in Python that you'd typically expect to return a module it isn't actually required: you can return any type during the module initialization process, and you can add any type to sys.modules.

The main advantage (for you) of a generic heap type of a module is that heap types don't need a PyModuleDef which exists for their lifetime - the spec that they're created with is copied across by PyType_FromSpec (and related functions) so only needs to exist for that function call.

One option would be to create a generic struct just to handle the basics of the type:

typedef struct {
    PyObject_HEAD;
    const char* name;
    PyObject* dict;
    struct cheese_data* data;
} BaseStruct;

Another variation would be to use a flexible array member as the last member (char extra_space[];) and just request extra space when you provide PyType_Spec.basicsize.

You can handle the initialization in the Py_tp_new slot and the cleanup in the Py_tp_dealloc slot (as usually).

Finally you'd provide PyMemberDefs for __name__ and __dict__ - something like:

static PyMemberDef cheese_members[] = {
    {"__dictoffset__", T_PYSSIZET, offsetof(BaseStruct, dict), READONLY},
    {"__name__", T_STRING, offsetof(BaseStruct, name), READONLY},
    {NULL}  /* Sentinel */
};

This should have most of the behaviour of a module but few of the restrictions.

Use a module

The module state is designed for storing this kind of extra information in. PEP-3121 gives an example of setting up a module state that I'll attempt to summarize here. I don't think you can use the example directly because I think the order of the fields has changed slightly since the PEP!

You create a struct that you use for the module state

typedef struct {
    struct cheese_data* data;
    /* alternatively cheese_data might just be part of the struct
     * rather than dynamically allocated in addition to the struct
     */
} MyModState;

In the PyModuleDef you specify the m_size option, typically as sizeof(MyModState). The complication here is that you can't use a static PyModuleDef because you're aiming to dynamically create this modules:

PyModuleDef *mod_def = malloc(sizeof(PyModuleDef));
struct PyModuleDef mod_def_tmp = {
    PyModuleDef_HEAD_INIT,
    .m_name = "some name",
    .m_size = sizeof(MyModState),
    /* etc */
};
*mod_def = mod_def_tmp;

Initializing the state structure goes in the Py_mod_exec slot (assuming you're using multi-phase initialization):

static int
my_module_exec(PyObject *m) {
    MyModState* state = (MyModState*)PyModule_GetState(m);
    state->data = malloc(sizeof(struct cheese_data));
    /* etc */
}

While cleanup of the state goes is handled by specifying a m_free function in the module spec,

static void free_module(PyObject* m) {
    MyModState* state = (MyModState*)PyModule_GetState(m);
    free(state->data);
}

/* in the PyModuleDef */
    .m_free = free_module,
/* ... */

The final issue is to deallocate the PyModuleDef so that it doesn't leak memory. Looking into the Python source code you see that the last use for the PyModuleDef object is looking up m_free and calling that.

Therefore, I believe it should be safe to also deallocate the module def by free(m->m_def) within m_free (and also to deallocate any parts of that which were dynamically allocated, for example the name strings). However - this seems like a hacky solution so I wouldn't treat it as completely future-proof for all versions of Python.