I need to store some followthemoney.proxy.EntityProxy
objects (in a other words, EntityProxy
is a legal entity object, e. g. a company) and its id with NumPy structured array. So I define following dtype
(company id looks like 0000012345
):
company_dtype = [('cid', 'S10'), ('company', 'object')]
EntityProxy
object has name
attribute (Unicode string of an arbitrary size). While this attribute contains only Basic Latin letters, array created successfully. But if company name contains Latin Extended A Unicode symbols (e. g. the name is "Letter Ł"
), it throws the UnicodeEncodeError
exception:
test1 = np.array(['0000098765', company1], dtype=company_dtype)
UnicodeEncodeError: 'ascii' codec can't encode character '\u0141' in position 7:
ordinal not in range(128)
The __repr__
method of EntityProxy
object returns a string, which is combination of hexadecimal string (not the 'cid'
from dtype
above) and name:
# <E('4922bf8e06bf2cbc6a65d3a36b6dd8211a995a33','Letter Ł')>
def __repr__(self) -> str:
return "<E(%r,%r)>" % (self.id, str(self))
I tried to change structured array dtype
to:
company_dtype = [('cid', 'S10'), ('company', 'U')]
but had the same error.
But when I try to create an ordinary (non-structured) NumPy array with the same object, it works!
# in an array, company2 is
# [<E('4922bf8e06bf2cbc6a65d3a36b6dd8211a995a33','Letter Ł')>]
test3 = np.array([company2])
Full code (variable values in comments taken from PyCharm IDE debug):
#!/usr/bin/env python
# coding: utf-8
import numpy as np
from followthemoney import model
schema = model.get("Company")
# creation of the followthemoney.proxy EntityProxy objects
# company with Latin Extended A Unicode symbol in its name
company1 = model.make_entity(schema)
company1.make_id(['id 1'])
company1.add("name", "Letter Ł")
# Basic Latin
company2 = model.make_entity(schema)
company2.make_id(['id 2'])
company2.add("name", "Letter A")
if __name__ == "__main__":
company_dtype = [('cid', 'S10'), ('company', 'O')]
try:
test1 = np.array(['0000098765', company1], dtype=company_dtype)
except UnicodeEncodeError:
# UnicodeEncodeError: 'ascii' codec can't encode character '\u0141' in
# position 7: ordinal not in range(128)
pass
# test 2 is
# [(b'0000012345', '0000012345'), (b'Letter A', <E('6352b36388511435d0395100145bde280663cfe2','Letter A')>)]
test2 = np.array(['0000012345', company2], dtype=company_dtype)
# for test3 and test4, its dtype is `object`
test3 = np.array([company1]) # [<E('4922bf8e06bf2cbc6a65d3a36b6dd8211a995a33','Letter Ł')>]
test4 = np.array([company2]) # [<E('6352b36388511435d0395100145bde280663cfe2','Letter A')>]
Questions:
- Why creation of the structured NumPy array with
dtype="object"
fails, but creation of the ordinary NumPy array is successful for the same object? - Which is correct
dtype
for the structured NumPy array in this case?
CodePudding user response:
Try:
company_dtype = [('cid', 'U10'), ('company', object)]
Now 'cid' can store a unicode string, while the 'company' field will still be the company object.
The UnicodeEncodeError
error came from the 'S10' dtype.