Home > database >  Parse unicode digits in Go
Parse unicode digits in Go

Time:05-31

Other answers mention using unicode.IsDigit() to check if a given rune is a digit or not, but how do I figure out which digit it is then?

Atoi and ParseInt from strconv won't parse it.

IsDigit checks a table with all of these codepoints in it, but I can't figure out anything from that. Many of the number ranges start with their 0 digit at a codepoint ending in 0, but not all of them so I can't just char & 0xF.

My only other thoughts is whether there's a way to either access the unicode name of a rune, or whether you can access properties. Every numeric unicode character (even fractions) seems to have a plain ASCII number associated with it behind the scenes as a property, but I can't seem to find a way to access either that information or the name (all unicode digits have names ending in "DIGIT ZERO" for example) anywhere. Am I looking/building outside of the standard library on this one?

CodePudding user response:

Using the runenames package to identify a digit based on the name.

This isn't a stardard library package, but it is part of golang.org/x/

These packages are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core. Install them with "go get".

import (
    "golang.org/x/text/unicode/runenames"

    "fmt"
    "strings"
)

func whatDigit(digit rune) int {
    var name = runenames.Name(digit)
    switch {
    case strings.Contains(name, "DIGIT ZERO"):
        return 0
    case strings.Contains(name, "DIGIT ONE"):
        return 1
    case strings.Contains(name, "DIGIT TWO"):
        return 2
    case strings.Contains(name, "DIGIT THREE"):
        return 3
    case strings.Contains(name, "DIGIT FOUR"):
        return 4
    case strings.Contains(name, "DIGIT FIVE"):
        return 5
    case strings.Contains(name, "DIGIT SIX"):
        return 6
    case strings.Contains(name, "DIGIT SEVEN"):
        return 7
    case strings.Contains(name, "DIGIT EIGHT"):
        return 8
    case strings.Contains(name, "DIGIT NINE"):
        return 9
    default:
        return -1
    }

    return 0
}

The package does mention a document https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt which seems to have further information for each character, including specifying which digit the character is in plain ASCII, however, this package only provides the name. Just looking through the document, the names seem to follow the pattern as shown in the whatDigit function.

  • Related