''' Set up '''
s= open("Bilion_of_UTF-8_chars.txt",encoding="UTF-8").read()
'''
The following doesn't look like a cheap operation
because Python3 `str`-s are UTF-8 encoded (EDIT: they're not).
'''
my_char= s[453_452_345]
However, many people write loops like this:
for i in range(s.len()):
do_something_with(s[i])
using indexing operation up to n times or more.
How does Python3 resolve the problem of indexing UTF-8 characters in strings for both code snippets?
- Does it always perform a linear look-up for nth char (which is both simple & expensive resolution)?
- Or maybe it stores some additional C pointers to perform smart index calculations?
CodePudding user response:
What is Python 3
str.__getitem__
computional complexity?
R: O(1)
Python strings are not utf-8 internally: in Python 3 when getting text from any external source, the text is decoded according to a given codec. This text decoding defaults to utf-8 in most sources/platforms, but varying accordingly to the S.O.'s default - anyway, all relevant "text importing" APIs, like opening a file, or connecting to a DB, allow you to specify the text encoding to use.
Inner strings use one of "Latin-1", "UCS-2" or "UCS-4" according to the needs of the "widest" codepoint in the text string.
This is new from Python 3.3 onwards (prior to that, all internal string representation would default to 32bit UCS-4, even for ASCII-only text). The spec is documented on PEP-393.
Therefore, Python can just zero-in the correct character given its index.
As an anecdote, Luciano Ramalho (author of Fluent Python book), wrote Leanstr
, a learning-purpose implementation of a string class that will hold utf-8 internally. Of course, then your worries about __getitem__
complexity apply: https://github.com/ramalho/leanstr
Unfortunatelly, (or fortunatelly, in this case), a lot of the standard library and native code extensions to Python will not accept a class similar to str
, even if it inherits from str
and keeps its data separetely, re-implementing all dunder methods. But if all str methods are in place, any pure-python code dealing with strings should accept a LeanStr
instance.
Other implementations: Pypy
So, it happens that how text is used internally is an "implementation detail", and Pypy from version 7.1 onwards does use utf-8 byte strings internally for its text objects.
Unlike Ramalho's naive "leanstr" above, however, they do keep an index for each 4th utf-8 char so that char access by index can still be made in O(1). I did not find any docs about it, but the code for creating the index is here.