Home > Software engineering >  How to generate all possible unicode characters?
How to generate all possible unicode characters?

Time:03-24

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.

What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:

Glyph    Decimal    Unicode    Usage in R
!        33         U 0021     "\U0021"

So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:

paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"

I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).

Any idea how to generate a vector with all the unicode characters listed in the table linked?

(The question started with a real world problem but now I am mostly simply eager to know how to do this.)

CodePudding user response:

There may be easier ways to do this, but here goes. The Unicode package contains everything you need.

First we can get a list of unicode scripts and the block ranges:

library(Unicode)  

uranges <- u_scripts()

Check what we've got:

head(uranges, 3)

$Adlam
[1] U 1E900..U 1E943 U 1E944..U 1E94A U 1E94B          U 1E950..U 1E959 U 1E95E..U 1E95F

$Ahom
 [1] U 11700..U 1171A U 1171D..U 1171F U 11720..U 11721 U 11722..U 11725 U 11726          U 11727..U 1172B U 11730..U 11739 U 1173A..U 1173B U 1173C..U 1173E U 1173F         
[11] U 11740..U 11746

$Anatolian_Hieroglyphs
[1] U 14400..U 14646

Next we can convert the ranges into their sequences.

expand_uranges <- lapply(uranges, as.u_char_seq)

To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:

all_unicode_chars <- unlist(expand_uranges)

# The Wikipedia page linked states there are 144,697 characters 
length(all_unicode_chars)
[1] 144762

So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:

intToUtf8(expand_uranges$Katakana[[1]])

[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"
  • Related