Home > other >  UnicodeEncodeError during NumPy structured array creation
UnicodeEncodeError during NumPy structured array creation

Time:06-26

I need to store some followthemoney.proxy.EntityProxy objects (in a other words, EntityProxy is a legal entity object, e. g. a company) and its id with NumPy structured array. So I define following dtype (company id looks like 0000012345):

company_dtype = [('cid', 'S10'), ('company', 'object')]

EntityProxy object has name attribute (Unicode string of an arbitrary size). While this attribute contains only Basic Latin letters, array created successfully. But if company name contains Latin Extended A Unicode symbols (e. g. the name is "Letter Ł"), it throws the UnicodeEncodeError exception:

test1 = np.array(['0000098765', company1], dtype=company_dtype)
UnicodeEncodeError: 'ascii' codec can't encode character '\u0141' in position 7:
ordinal not in range(128)

The __repr__ method of EntityProxy object returns a string, which is combination of hexadecimal string (not the 'cid' from dtype above) and name:

# <E('4922bf8e06bf2cbc6a65d3a36b6dd8211a995a33','Letter Ł')>
def __repr__(self) -> str:
    return "<E(%r,%r)>" % (self.id, str(self))

I tried to change structured array dtype to:

 company_dtype = [('cid', 'S10'), ('company', 'U')]

but had the same error.

But when I try to create an ordinary (non-structured) NumPy array with the same object, it works!

# in an array, company2 is 
# [<E('4922bf8e06bf2cbc6a65d3a36b6dd8211a995a33','Letter Ł')>]
test3 = np.array([company2])

Full code (variable values in comments taken from PyCharm IDE debug):

#!/usr/bin/env python
# coding: utf-8

import numpy as np
from followthemoney import model

schema = model.get("Company")

# creation of the followthemoney.proxy EntityProxy objects
# company with  Latin Extended A Unicode symbol in its name
company1 = model.make_entity(schema)
company1.make_id(['id 1'])
company1.add("name", "Letter Ł")
# Basic Latin
company2 = model.make_entity(schema)
company2.make_id(['id 2'])
company2.add("name", "Letter A")


if __name__ == "__main__":
    company_dtype = [('cid', 'S10'), ('company', 'O')]
    try:
        test1 = np.array(['0000098765', company1], dtype=company_dtype)
    except UnicodeEncodeError:
        # UnicodeEncodeError: 'ascii' codec can't encode character '\u0141' in
        # position 7: ordinal not in range(128)
        pass
    # test 2 is
    # [(b'0000012345', '0000012345'), (b'Letter A', <E('6352b36388511435d0395100145bde280663cfe2','Letter A')>)]
    test2 = np.array(['0000012345', company2], dtype=company_dtype)
    # for test3 and test4, its dtype is `object`
    test3 = np.array([company1])  # [<E('4922bf8e06bf2cbc6a65d3a36b6dd8211a995a33','Letter Ł')>]
    test4 = np.array([company2])  # [<E('6352b36388511435d0395100145bde280663cfe2','Letter A')>]

Questions:

  1. Why creation of the structured NumPy array with dtype="object" fails, but creation of the ordinary NumPy array is successful for the same object?
  2. Which is correct dtype for the structured NumPy array in this case?

CodePudding user response:

Try:

 company_dtype = [('cid', 'U10'), ('company', object)]

Now 'cid' can store a unicode string, while the 'company' field will still be the company object.

The UnicodeEncodeError error came from the 'S10' dtype.

  • Related