I'm using built-in isidentifier() function to find Unicode chars allowed for variable names (I know about xid_start and xid_continue chars, don't need explanation on that). The following program has certain inconsistency with it's results on different systems. I'm very confused and interested about the reasoning.
chars = []
for char in range(0x110000):
char = chr(char)
if char.isidentifier() or ('a' char).isidentifier():
chars = [char]
print(len(chars))
Program results running in PyCharm gives me 134415, but running it on repl.it gives me 128770. My python version is 3.9.7, while repl's is 3.8.12. Everything I was able to find was this isidentifier() documentation, which gives a hint at PEP 3131 standard which is used in Python 3. But both I and repl.it are using same major python version, it's just minor version difference. Looking for function changelog also gives no results. Hope you will be able to help me resolve this issue!
CodePudding user response:
They're using different versions of unicode data
Try adding to your script
import unicodedata
print(unicodedata.unidata_version)
For me, repl.it was using version 12.1.0 and my python 3.9.9 on mac 12.3 was using version 13.0.0
The pep you link to says that the characters depend on the DerivedCoreProperties.txt file thats in the unicode version used by python
The exact specification of what characters have the XID_Start or XID_Continue properties can be found in the DerivedCoreProperties file of the Unicode data in use by Python
This matches up to what the unicodedata module says in its docs.
When using python 3.8
The data contained in this database is compiled from the UCD version 12.1.0.
When using python 3.9
The data contained in this database is compiled from the UCD version 13.0.0.