If we type in letters
we get all lowercase letters from english alphabet. However, there are many more possible characters like ä
, é
and so on. And there are symbols like $
or (
, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U 0021 "\U0021"
So if type "\U0021"
we get a !
. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0"))
returns "U0021"
which is quite close to what I need but adding \
results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode()
there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)
CodePudding user response:
There may be easier ways to do this, but here goes. The Unicode
package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U 1E900..U 1E943 U 1E944..U 1E94A U 1E94B U 1E950..U 1E959 U 1E95E..U 1E95F
$Ahom
[1] U 11700..U 1171A U 1171D..U 1171F U 11720..U 11721 U 11722..U 11725 U 11726 U 11727..U 1172B U 11730..U 11739 U 1173A..U 1173B U 1173C..U 1173E U 1173F
[11] U 11740..U 11746
$Anatolian_Hieroglyphs
[1] U 14400..U 14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"