Home > Software engineering >  how to tackle the python set object operations in C?
how to tackle the python set object operations in C?

Time:02-11

I want to introduce C code in Python and the C code has the following statements:

  • test.h
#include<Python.h>

PyObject *getFeature(wchar_t *text,
                     PyObject *unigram);
// where the unigram is a Set Object with type 'PySetObject'
  • test.c
#include<test.h>

PyObject *getFeature(wchar_t *text,
                     PyObject *unigram)
{
    int ret = -1;
    PyObject *featureList = PyList_New(0);

    PyObject *curString = PyUnicode_FromWideChar(text, 2);
    ret = PySet_Contains(unigram, curString);
    printf("## res: `nc`, %d.\n", ret);
    ret = PyList_Append(featureList, curString);

    return featureList;
}

and then I compiled it and get a shared lib called libtest.so. So I can import this C .so file into the python code with ctypes like below:

  • test.py
import ctypes

dir_path = 'path/to/the/libtest.so'
feature_extractor = ctypes.cdll.LoadLibrary(
    os.path.join(dir_path, 'libtest.so'))
get_feature_c = feature_extractor.getFeature
get_feature_c.argtypes = [
    ctypes.c_wchar_p, ctypes.py_object]
get_feature_c.restype = ctypes.py_object

unigram = {'据','nc', 'kls'}
print(hash('据'))
print(hash('nc'))
print(hash('kls'))
res = get_feature_c('nc', unigram)


execute this test.py file and I can get the following fault:

6875335301337518411
6875335301337518411
-5567445891360670268
Segmentation fault

I know the bug is caused by the confliction of different string nc and , which have the same hash value 6875335301337518411. Python use a secondary level hashtable to tackle the confliction of strings with same hash value.

So how to solve this problem and import the secondary confliction hashtable to the C code?

CodePudding user response:

The hash match is a red herring. The problem is not using PyDLL so the GIL is held when using the CPython APIs.

test.c

#include <Python.h>

#ifdef _WIN32
#   define API __declspec(dllexport)
#else
#   define API
#endif

API PyObject *getFeature(wchar_t *text, PyObject *unigram)
{
    int ret = -1;
    PyObject *featureList = PyList_New(0);

    PyObject *curString = PyUnicode_FromWideChar(text, 2);
    ret = PySet_Contains(unigram, curString);
    printf("## res: `nc`, %d.\n", ret);
    ret = PyList_Append(featureList, curString);
    Py_DECREF(curString); // fix reference leak
    return featureList;
}

test.py

import ctypes as ct

dll = ct.PyDLL('./test') # Use PyDLL so GIL is held
dll.getFeature.argtypes = ct.c_wchar_p, ct.py_object
dll.getFeature.restype = ct.py_object

unigram = {'据','nc', 'kls'}
print(hash('据'))
print(hash('nc'))
print(hash('kls'))
print(dll.getFeature('nc', unigram))

Output:

5393181648594783828
5393181648594783828
-5015907635941537187
## res: `nc`, 1.
['nc']
  • Related