Pandas to_records() dtype conversion to char / unicode issue-CodePudding

Pandas to_records() throws an error while numpy.array is behaving like expected.

data = [('myID', 5), ('myID', 10)]
myDtype = numpy.dtype([('myID', numpy.str_,4),
                       ('length', numpy.uint16)])

Working:

arr = numpy.array(data, dtype=myDtype)
output: [('myID',  5) ('myID', 10)]

This is not working

df = pd.DataFrame(data)
df = df.to_records(index=False, column_dtypes=myDtype)

ValueError: invalid literal for int() with base 10: 'myID'

What I am doing wroing with pandas to_records()?

CodePudding user response：

Ok so from what I understand, the way you wrote your variable myDtype isn't compatible with the column names your dataframe has.

Your current dataframe columns are int values of 0 and 1, causing your error (trying to match the int 0 to your naming "myID"). (Not entirely sure about that one so someone might want to complement, I'll edit the answer.)

I was able to remove the error by referring the column_dtypes with a dictionary :

    data = [("myID", 5), ("myID", 10)]
    myDtype = numpy.dtype([('myID', numpy.str_, 4),
                       ('length', numpy.uint16)])
    df = pd.DataFrame(data, columns=["myID", "length"])
    df_records = df.to_records(index=False, column_dtypes={"myID": "<U4", "length": "<u2"})

With the following result :

rec.array([('myID',  5), ('myID', 10)],
          dtype=[('myID', '<U4'), ('length', '<u2')])

CodePudding user response：

column_dtypes argument in the to_records() function of a pandas dataframe expects a dict datatype as its input. But you are passing myDtype as the argument which is of type numpy.dtype.

Try this, it should work -

df = pd.DataFrame(data)
df_rec = df.to_records(index = False, column_dtypes = myDtype.fields)

The output is -

>>> df_rec
rec.array([('myID',  5), ('myID', 10)],
          dtype=[('0', 'O'), ('1', '<i8')])