Canonical way to convert an array of strings in C to a Python list using Cython-CodePudding

I'm using Cython to interface a C library with Python. A library function returns an array of null-terminated strings with type char** and I want to convert this to a Python list of str. The following code works, but it seems fragile and clunky and I wonder if there is a simpler way to do it:

# myfile.pyx

from cython.operator import dereference

def results_from_c():
    cdef char** cstringsptr = my_c_function()

    strings = []

    string = dereference(cstringsptr)
    while string != NULL:
        strings.append(string.decode())
        cstringsptr  = 1
        string = dereference(cstringsptr)

    return strings

In particular, is it ok to get the next string in the array with cstringsptr = 1 like one would do in C with e.g. cstringsptr ;? Is this in general a robust way to convert arrays to lists? What if e.g. memory allocation fails or the string is not null terminated and it loops forever? It seems to me like there should be a simpler way to do this with Cython.

CodePudding user response：

If you are working with a valid C data structure, the strings will be null-terminated. The question is, how is the array of string pointers terminated? Either the library (or my_c_function()) ensures that there is a NULL after the last string pointer, or it makes the array length available in some other way. Make sure you know which it is, and don't make your loop terminate on a null pointer unless you are guaranteed that there will be one.

CodePudding user response：

To complete the answer of @alexis, in term of performance, using append is quite slow (because it use a growing array internally) and it can be replaced by direct indexing. The idea is to perform two walk to know the number of strings. While a two walks seems expensive, this should not be the case since compiler should optimize this loop. If the code is compiled with the highest optimization level (-O3), the first loop should use very fast SIMD instructions. Once the length is known, the list can be allocated/filled in a much faster way. String decoding should take a significant part of the time. UTF-8 decoding is used by default. This is a bit expensive and using ASCII decoding instead should be a bit faster assuming the strings are known not to contain special characters.

Here is an example of untested code:

from cython.operator import dereference

def results_from_c():
    cdef char** cstringsptr = my_c_function()
    cdef char** cstringsptr2 = cstringsptr
    cdef int length = 0
    cdef int i

    string = dereference(cstringsptr)
    while string != NULL:
        cstringsptr  = 1
        string = dereference(cstringsptr)

    cstringsptr -= length

    # None is just a null pointer so that this just allocates a 0-filled array
    strings = [None] * length

    for i in range(length):
        string = dereference(cstringsptr   i)
        strings[i] = string.decode()

    return strings

This makes the code more complex though.